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Abstract 

This paper is a multidisciplinary review of empirical, statistical learning from a graph- 
ical model perspective. Well-known examples of graphical models include Bayesian net- 
works, directed graphs representing a Markov chain, and undirected networks representing 
a Markov field. These graphical models are extended to model data analysis and empirical 
learning using the notation of plates. Graphical operations for simplifying and manipulat- 
ing a problem are provided including decomposition, differentiation, and the manipulation 
of probability models from the exponential family. Two standard algorithm schemas for 
learning are reviewed in a graphical framework: Gibbs sampling and the expectation max- 
imization algorithm. Using these operations and schemas, some popular algorithms can 
be synthesized from their graphical specification. This includes versions of linear regres- 
sion, techniques for feed-forward networks, and learning Gaussian and discrete Bayesian 
networks from data. The paper concludes by sketching some implications for data analysis 
and summarizing how some popular algorithms fall within the framework presented. 

The main original contributions here are the decomposition techniques and the demon- 
stration that graphical models provide a framework for understanding and developing com- 
plex learning algorithms. 

1. Introduction 

A probabilistic graphical model is graph where the nodes represent variables and the arcs 
(directed or undirected) represent dependencies between variables. They are used to define 
a mathematical form for the joint or conditional probability distribution between variables. 
Graphical models come in various forms: Bayesian networks used to represent causal and 
probabilistic processes, data-flow diagrams used to represent deterministic computation, 
influence diagrams used to represent decision processes, and undirected Markov networks 
(random flelds) used to represent correlation for images and hidden causes. 

Graphical models are used in domains such as diagnosis, probabilistic expert systems, 
and, more recently, in planning and control (Dean & Wellman, 1991; Chan & Shachter, 
1992), dynamic systems and time-series (KjaerufF, 1992; Dagum, Galper, Horvitz, & Seiver, 
1994), and general data analysis (Gilks et al., 1993a) and statistics (Whittaker, 1990). This 
paper shows the task of learning can also be modeled with graphical models. This meta- 
level use of graphical models was flrst suggested by Spiegelhalter and Lauritzen (1990) in 
the context of learning probabilities for Bayesian networks. 

Graphical models provide a representation for the decomposition of complex problems. 
They also have an associated set of mathematics and algorithms for their manipulation. 
When graphical models are discussed, both the graphical formalism and the associated 
algorithms and mathematics are implicitly included. In fact, the graphical formalism is 
unnecessary for the technical development of the approach, but its use conveys the important 
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structural information of a problem in a natural visual manner. Graphical operations 
manipulate the underlying structure of a problem unhindered by the fine detail of the 
connecting functional and distributional equations. This structuring process is important 
in the same way that a high-level programming language leads to higher productivity over 
assembly language. 

A graphical model can be developed to represent the basic prediction done by linear 
regression, a Bayesian network for an expert system, a hidden Markov model, or a con- 
nectionist feed-forward network (Buntine, 1994). A graphical model can also be used to 
represent and reason about the task of learning the parameters, weights, and structure of 
each of these representations. An extension of the standard graphical model that allows 
this kind of learning to be represented is used here. The extension is the notion of a plate 
introduced by Spiegelhalter^ (1993). Plates allow samples to be explicitly represented on 
the graphical model, and thus reasoned about and manipulated. This makes data analysis 
problems explicit in much the same way that utility and decision nodes are used for decision 
analysis problems (Shachter, 1986). 

This paper develops a framework in which the basic computational techniques for learn- 
ing can be directly applied to graphical models. This forms the basis of a computational 
theory of Bayesian learning using the language of graphical models. By a computational 
theory we mean that the approach shows how a wide variety of learning algorithms can 
be created from graphical specifications and a few simple algorithmic criteria. The basic 
computational techniques of probabilistic (Bayesian) inference used in this computational 
theory of learning are widely reviewed (Tanner, 1993; Press, 1989; Kass & Raftery, 1993; 
Neal, 1993; Bretthorst, 1994). These include various exact methods, Markov chain Monte 
Carlo methods such as Gibbs sampling, the expectation maximization (EM) algorithm, and 
the Laplace approximation. More specialized computational techniques also exist for han- 
dling missing values (Little & Rubin, 1987), making a batch algorithm incremental, and 
adapting an algorithm to handle large samples. With creative combination, these techniques 
are able to address a wide range of data analysis problems. 

The paper provides the blueprint for a software toolkit that can be used to construct 
many data analysis and learning algorithms based on a graphical specification. The con- 
ceptual architecture for such a toolkit is given in Figure 1. Probability and decision theory 
are used to decompose a problem into a computational prescription, and then search and 
optimization techniques are used to fill the prescription. A version of this toolkit already 
exists using Gibbs sampling as the general computational scheme (Gilks et al., 1993b). The 
list of algorithms that can be constructed in one form or another by the scheme in Fig- 
ure 1 is impressive. But the real gain from the scheme does not arise from the potential 
re-implementation of existing software, but from understanding gained by putting these in 
a common language, the ability to create novel hybrid algorithms, and the ability to tailor 
special purpose algorithms for specific problems. 

This paper is tutorial in the sense that it collects material from different communities and 
presents it in the language of graphical models. This paper introduces graphical models, 
to represent first-order inference and learning. Second, this paper develops and reviews 
a number of operations on graphical models. Finally, this paper gives some examples 

1. The notion of a "replicated node" was my version of this developed independently. I have adopted the 
notation of Spiegelhalter and colleagues for uniformity. 
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Figure 1: A software generator 



of developing learning algorithms using combinations of the basic operations. The main 
original contribution is the demonstration that graphical models provide a framework for 
understanding and developing complex learning algorithms. 
In more detail, the paper covers the following topics. 

Introduction to graphical models: Graphical models are used in two ways. 

Graphical models for representing inference tasks: Section 2 reviews some ba- 
sics of graphical models. Graphical models provide a means of representing pat- 
terns of inference. 

Adapting graphical representations to represent learning: Section 3 discusses 
the representation of the problem of learning within graphical models using the 
notion of a plate. 

Operations on graphical models: Operations take a graphical representation of a learn- 
ing problem and simplify it or perform an important calculation required to solve the 
problem. 

Operations using closed-form solutions: Section 4 covers those classes of learn- 
ing problems where closed-form solutions to learning are known. This section 
adapts standard statistical methods to graphical models. 

Other basic operations: Other operations on graphs are required to be able to 
handle more complex problems. These are covered in Section 6, including: 

Decomposition: Breaking a learning problem into independent components 

and evaluating each component. 
Differentiation: Computing the derivative of a probability or log probability 

with respect to variables on the graph. Differentiation can be decomposed 

into operations local to groups of nodes in the graph as is popular in neural 

networks. 

Some approximate operations: Some approximate algorithms follow naturally 
from the above methods. Section 7 reviews Gibbs sampling and its deterministic 
cousin, the EM algorithm. 
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Some example algorithms: The closed-form solutions to learning can sometimes be 
used to form a fast inner loop of more complex algorithms. Section 8 illustrates how 
graphical models help here. 

The conclusion lists some common algorithms and their derivation within the above frame- 
work. Proofs of lemmas and theorems are collected in Appendix A. 

2. Introduction to graphical models 

This section introduces graphical models. The brief tour is necessary before introducing 
the operations for learning. 

Graphical models offer a unified qualitative and quantitative framework for representing 
and reasoning with probabilities and independencies. They combine a representation for 
uncertain problems with techniques for performing inference. Flexible toolkits and systems 
exist for applying these techniques (Srinivas & Breese, 1990; Andersen, Olesen, Jensen, & 
Jensen, 1989; Cowell, 1992). Graphical models are based on the notion of independence, 
which is worth repeating here. 

Definition 2.1 A is independent of B given C if p(A,B\C) = p(A\C )p(B\C ) whenever 
p(C) ^ 0, for all A,B,C. 

The theory of independence as a basic tool for knowledge structuring is developed by Dawid 
(1979) and Pearl (1988). A graphical model can be equated with the set of probability distri- 
butions that satisfy its implied constraints. Two graphical models are equivalent probability 
models if their corresponding sets of satisfying probability distributions are equivalent. 

2.1 Directed graphical models 

The basic kind of graphical model is the Bayesian network, also called belief net, which 
is most popular in artificial intelligence. See Charniak (1991), Shachter and Heckerman 
(1987), and Pearl (1988) for an introduction. This is also a graphical representation for a 
Markov chain. A Bayesian network is a graphical model that uses directed arcs exclusively 
to form a directed acyclic graph (DAG), (i.e., a directed graph without directed cycles). 
Figure 2, adapted from (Shachter & Heckerman, 1987) shows a simple Bayesian network for 



a simplified medical problem. The graphical model represents a conditional decomposition 
of the joint probability (see (Lauritzen, Dawid, Larsen, & Leimer, 1990) for more details 
and interpretations). This decomposition works as follows (full variable names have been 




Figure 2: A simplified medical problem 
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abbreviated). 

p(Age,Occ,Clim,Dis,Symp\M) = (1) 
piAge\M) p(Occ\M) p(Clim\M) p(Dis\Age, Occ, Clim, M) p(Symp\Dis, M) , 

where M is the conditioning context, for instance the expert's prior knowledge and the 
choice of the graphical model in Figure 2. Each variable is written conditioned on its 
parents, where parents(x) is the set of variables with a directed arc into x. The general 
form for this equation for a set of variables X is: 

p(X|M) = Yi p(x\parents(x),M) . (2) 
xex 

This equation is the interpretation of a Bayesian network used in this paper. 
2.2 Undirected graphical models 

Another popular form of graphical model is an undirected graph, sometimes called a Markov 
network (Pearl, 1988). This is a graphical model for a Markov random field. Markov random 
fields became used in statistics with the advent of the Hammersley- Clifford theorem (Besag, 
York, & MoUie, 1991). A variant of the theorem is given later in Theorem 2.1. Markov 
random fields are used in imaging and spatial reasoning (Ripley, 1981; Geman & Geman, 
1984; Besag et al., 1991) and various stochastic models in neural networks (Hertz, Krogh, 
& Palmer, 1991). Undirected graphs are also important because they simplify the theory 
of Bayesian networks (Lauritzen et al., 1990). 

Figure 3 shows a simple 4x4 image and an undirected model for the image. This model 




Figure 3: A simple 4x4 mage and its graphical model 



is based on the first degree Markov assumption; that is, the current pixel is only directly 
infiuenced by pixels positioned next to it, as indicated by the undirected arcs between 
variables pij and Pij+i, Pij and Pi+ij, etc. Each node x (corresponding to a pixel) has 
its set of neighbors — those nodes it is directly connected to by an undirected arc. For 
instance, the neighbors of pi^i are pi,2, P2,2 and P2,i- For the variable/node x, denote these 
by neighbors(x). 

In general, there is no formula for undirected graphs in terms of conditional probabilities 
corresponding to Equation (1) for the Bayesian network of Figure 2. However, a functional 
decomposition does exist in another form based on the maximal cliques in Figure 3. Maximal 
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cliques are subgraphs that are fully connected but are not strictly contained in other fully 
connected subgraphs. These are the 9 sets of 2 X 2 cliques such as {pi, 2,^1,37^2, 27^2,3}- The 
interpretation of the graph is that the joint probability is a product over functions of the 
maximal cliques. 

. . .,P4,4) = (3) 

fl (Pl,l , Pl,2, P2,l5 P2,2) /2(Pl,2, Pl,3, P2,2, P2,3) /3(Pl,3, Pl,45 P2,3, P2,4) 
U{P2,1 , P2,2, P3,l5 P3,2) /5(P2,2, P2,3, P3,2, P3,3) /6(P2,3, ^2,4^ P3,3, P3,4) 
/7(P3,1,P3,2,P4,1,P4,2) /8(P3,2, P3,3, P4,2, P4,3) /9(P3,3, P3,4, P4,3, P4,4) , 

for some functions /i, . . . , /g defined up to a constant. From this formula it follows that pi^s 
is conditionally independent of its non-neighbors given its neighbors ^1,2, P2, 2,^2,37 Pi, 47^2,4- 
The general form for Equation (3) for a set of variables X is given in the next theorem. 
Compare this with Equation 2. 

Theorem 2.1 An undirected graph G is on variables in the set X . The set of maximal 
cliques on G is Cliques{G) C 2^ ■ The distribution p{X) (probability or probability density) 
is strictly positive in the domain Xxexdomain(x). Then under the distribution piX), x 
is independent of X — {x} — neighbors(x) given neighbors(x) for all x £ X (Frydenberg 
(1990) refers to this condition as local G-Markovian) if, and only if, p>{X) has the functional 
representation 

P{X) = n fciC) , (4) 

CeCliques(G) 

for some functions /c > 0. 

The general form of this theorem for finite discrete domains is called the Hammersley- 
Clifford Theorem (Geman, 1990; Besag et al., 1991). Again, this equation is used as the 
interpretation of a Markov network. 

2.3 Conditional probability models 

Consider the conditional probability p(Dis\Age, Dec, Glim) found in the simple medical 
problem from Figure 2 and Equation (1). This conditional probability models how the 
disease should vary for given values of age, occupation, and climate. Class probability trees 
(Breiman, Friedman, Olshen, & Stone, 1984; Quinlan, 1992), graphs and rules (Rivest, 
1987; Oliver, 1993; Kohavi, 1994), and feed-forward networks are representations devised to 
express conditional models in different ways. In statistics, the conditional distributions are 
also represented as regression models and generalized linear models (McCuUagh & Nelder, 
1989). 

The models of Figure 2 and Equation (1) and Figure 3 and Equation (3) show how 
the joint distribution is composed from simpler components. That is, they give a global 
model of variables in a problem. The conditional probability models, in contrast, give a 
model for a subset of variables conditioned on knowing the values of another subset. In 
diagnosis the concern may be a particular direction for reasoning, such as predicting the 
disease given patient details and symptoms, so the full joint model provides unnecessary 
detail. The full joint model may require extra parameters and thus more data to learn. In 
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supervised learning applications, the general view is that conditional models are superior 
unless prior knowledge dictates a full joint model is more appropriate. This distinction is 
sometimes referred to as the diagnostic versus the discriminant approach to classification 
(Dawid, 1976). 

There are a number of ways for explicitly representing conditional probability mod- 
els. Any joint distribution implicitly gives the conditional distribution for any subset of 
variables, by definition of conditional probability. For instance, if there is a model for 
p(Age,Occ,Clim,Dis, Symp), then by the definition of conditional probability a condi- 
tional model follows: 



p(Dis\Age, Occ, Clim, Symp) 



p(Age, Occ, Clim, Dis, Symp) 
Pi/^ge, Occ.Clim, Dis, Symp) 
(X p(Age,Occ,Clim,Dis, Symp) . 



Conditional distributions can also be represented by a single node that is labeled to identify 
which functional form the node takes. For instance, in the graphs to follow, labeled Gaussian 
nodes, linear nodes, and other standard forms are all used. Conditional models such as rule 
sets and feed-forward networks can be constructed by the use of special deterministic nodes. 
For instance. Figure 4 shows four model constructs. In each case, the input variables to the 




'Logistic 



- © 

\lJGaussian 



Figure 4: Graphical conditional models 

conditional models represented have been shaded. This shading means the values of these 
variables are known or given. In Figure 4(b), the shading indicates that the value for x is 
known, but the value for c is unknown. Presumably c will be predicted using x. Figure 4(a) 
represents a rule set. The nodes with double ovals are deterministic functions of their inputs 
in contrast to the usual nodes, which are probabilistic functions. This means that the value 
for rulci is a deterministic function of xi,. . . , a;„. Notice that this implies that the values 
for rulci and the others are known as well. The conditional probability for a deterministic 
node, as required for Equation (2), is treated as a delta function. Figure 4(c) corresponds 
to the statement: 



p(unit\xi, . 



unit = f{x\^...^Xn) 



if unit 



f(xi , . . . , Xji) , 

otherwise 



for some function / not specified. A Bayesian network constructed entirely of double ovals is 
equivalent to a data fiow graph where the inputs are shaded. The analysis of deterministic 
nodes in Bayesian networks and, more generally, in infiuence diagrams is considered by 
Shachter (1990). For some purposes, deterministic nodes are best treated as intermediate 
variables and removed from the problem. The method for doing this, variable elimination, 
is given later in Lemma 6.1. 
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The logical or conjunctive form of each rule in Figure 4(a) is not expressed in the graph, 
and presumably would be given in the formulas accompanying the graph, however the basic 
functional structure of the rule set exists. In Figure 4(b), a node has been labeled with its 
functional type. The functional type for this node with a Boolean variable c is the function, 

p(c = l\x) = — ~ = Sigmoid(x) = Logistic~^(x) (5) 

which maps a real value x onto a probability in (0, 1) for the binary variable c. This function 
is the inverse of the logistic or logit function used in generalized linear models (McCuUagh 
& Nelder, 1989), and is also the sigmoid function used in feed-forward neural networks. 
Figure 4(c) uses a deterministic node to reproduce a single unit from a connectionist feed- 
forward network, where the unit's activation is computed via a sigmoid function. Figure 4(d) 
is a simple univariate Gaussian, which makes y normally distributed with mean and 
standard deviation a. Here the node is labeled in italics to indicate its conditional type. 

At a more general level, networks can be conditional. Figure 5 shows two conditional 
versions of the simple medical problem. If the shading of nodes is ignored, the joint proba- 




Figure 5: Two equivalent conditional models of the medical problem 



bility, p(Age,Occ,Clim,Dis,Symp) for the two graphs (a) and (b) is: 

p(Age) p(Occ\Age) p(Clim\Age, Occ) p(Dis\Age, Occ, Clim) p(Symp\Age, Dis) 
p(Age) p(Occ) p(Clim) p(Dis\Age, Occ, Clim) p(Symp\Age, Dis) . 

However, because four of the five nodes are shaded, this means their values are known. The 
conditional distributions computed from the above are identical: 

p(Dis\Age, Occ, Clim, Synip) 

p(Dis\Age, Occ, Clim) p(Symp\Age, Dis) 
Y^Dis p(Dis\Age, Occ, Clim) p(Symp\Age, Dis) 

Why do these distinct graphs become identical when viewed from the conditional perspec- 
tive? Because conditional components of the model corresponding to age, occupation and 
climate cancel out when the conditional distribution is formed. However, the symptoms 
node has the unknown variable disease as a parent, so the arc from age to symptoms is 
kept. 

More generally, the following simple lemma applies and is derived directly from Equa- 
tion 2. 



166 



Learning with Graphical Models 



Lemma 2.1 Given a Bayesian network G with some nodes shaded representing a condi- 
tional probability distribution, if a node X and all its parents have their values given, then 
the Bayesian network G' created by deleting all the arcs into X represents an equivalent 
probability model to the Bayesian network G. 

This does not mean, for instance in the graphs just discussed, that there is no causal or 
influential links between the variables age, occupation, and climate, rather that their ef- 
fects become irrelevant in the conditional model considered because their values are already 
known. A corresponding result holds for undirected graphs, and follows directly from The- 
orem 2.1. 

Lemma 2.2 Given an undirected graph G with some nodes shaded representing a condi- 
tional probability distribution, delete an arc between nodes A and B if all their common 
neighbors are given. The resultant graph G' represents an equivalent probability model to 
the graph G. 

2.4 Mixed graphical models 

Undirected and directed graphs can also be mixed in a sequence. These mixed graphs are 
called chain graphs (Wermuth & Lauritzen, 1989; Frydenberg, 1990). These chain graphs 
are sometimes used here. However, a precise understanding of them is not required for this 
paper. A simple chain graph is given in Figure 6. In this case, the single disease node 




Figure 6: An expanded medical problem 



and single symptom node of Figure 2 are expanded to represent the case where there are 
two possibly co-occurring diseases and three possibly co-occurring symptoms. The medical 
specialist may have said something like: "Lung disease and heart disease can influence 
each other, or may have some hidden common cause; however, it is often difficult to tell 
which is the cause of which, if at all." In the causal model, join the two disease nodes 
by an undirected arc to represent direct influence. Likewise for the symptom nodes. The 
resultant joint distribution takes the form: 

p(Age,Occ,Clim, Heart-Dis, Lung-Dis, Symp-A, Symp-B , Symp-C) = (6) 
p(Age) p(Occ) p(Clim) p(Heart-Dis, Lung-Dis\Age, Occ, Clim) 
p(Symp-A, Symp-B, Symp-C\H eart-Dis, Lung-Dis) 
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where the last two conditional probabilities can take on an arbitrary form. Notice that the 
probabilities now have more than one variable on the left side. 

In general, a chain graph consists of a chain of undirected graphs connected by directed 
arcs. Any cycle through the graph cannot have directed arcs going in opposite directions. 
Chain graphs can be interpreted as Bayesian networks defined over the components of the 
chain instead of the original variables. This goes as follows: 

Definition 2.2 Given a subgraph G over some variables X , the chain components are sub- 
sets of X that are maximal undirected connected subgraphs in a chain graph G (Frydenberg, 
1990). Furthermore, let chain-components(A) denote all nodes in the same chain compo- 
nent as at least one variable in A. 

The chain components for the graph above, ordered consistently with the directed arcs, 
are {Age}, {Dec}, {Clim}, {Heart-Dis,Lung-Dis}, and {Symp-A, Symp-B, Symp-C}. 
Informally, a chain graph over variables X with chain components given by the set T 
is interpreted first as the decomposition corresponding to the decomposition of Bayesian 
networks in Equation (2): 

p(X\M) = Y[p(T\parents(T),M) (7) 

where 

parents[A) = parents[a) — A . 

aeA 

Each component probability p(T\parents(T), M ) has a form similar to Equation (4). 

Sometimes, to process graphs of this form without having to consider the mathematics 
of chain graphs, the following device is used. 

Comment 2.1 When a set of nodes U in a chain graph form a clique (a fully connected 
subgraph), and all have identical children and parents otherwise, then the set of nodes can 
be replaced a single node representing the cross product of the variables. 

This operation for Figure 6 is done to get Figure 7. Furthermore, chain graphs are sometimes 





Heart Disease &~~ 
Limg Disease 



Figure 7: An expanded medical problem 
used here where Bayesian networks can also be used. In this case: 

Comment 2.2 The chain components of a Bayesian network are the singleton sets of in- 
dividual variables in the graph. Furthermore, chain-components(A) = A. 
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Chain graphs can be decomposed into a chain of directed and undirected graphs. An 
example is given in Figure 8. Figure 8(a) shows the original chain graph. Figure 8(b) shows 



its directed and undirected components together with the Bayesian network on the right 
showing how they are pieced together. Having done this decomposition, the components 
are analyzed using all the machinery of directed and undirected graphs. The interpretation 
of these graphs in terms of independence statements and the implied functional form of 
the joint probability is a combination of the previous two forms given in Equation (2) 
and Theorem 2.1, based on (Frydenberg, 1990, Theorem 4.1), and on the interpretation of 
conditional graphical models in Section 2.3. 

3. Introduction to learning with graphical models 

A simplified inference problem is represented in Figure 9. Here, the nodes vari, var2 and 
var^ are shaded. This represents that the value of these nodes is given, so the inference task 
is to predict the value of the remaining variable class. This graph matches the so-called 
"idiot's" Bayes classifier (Duda & Hart, 1973; Langley, Iba, & Thompson, 1992) used in 
supervised learning for its speed and simplicity. The probabilities on this network are easily 
learned from data about the three input variables vari, var2 and var^, and class. This 
graph also matches an unsupervised learning problem where the class class is not in the 
data but is hidden. An unsupervised learning algorithm learns hidden classes (Cheeseman, 
Self, Kelly, Taylor, Freeman, & Stutz, 1988; McLachlan & Basford, 1988). 




Figure 8: Decomposing a chain graph 




Figure 9: A simple classification problem 
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The implied joint for these variables read from the graph is: 

p(class,vari,var2,var3) = p(class) p(vari\class) p(var2\class) p(var3\class) . (8) 

The Bayesian classifier gets its name because it is derived by applying Bayes theorem to 
this joint to get the conditional formula: 

p(class)p(vari\class) p(var2\class) pivarslclass) 

p{class\vari,var2,var3) = — — — — — — — — . (9) 

l^dass P(ciass)p(vari\class) p(var2\class) p(var3\class) 

The same formula is used to predict the hidden class for objects in the simple unsuper- 
vised learning framework. Again, this formula, and corresponding formula for more general 
classifiers, can be found automatically by using exact methods for inference on Bayesian 
networks. 

Consider the simple model given in Figure 9. If the matching unsupervised learning 
problem for this model was represented, a sample of N cases of the variables would be 
observed, with the first case being vari^i, var2^i, var^^i, and the iV-th case being vari^N, 
var2^N, var^^N. The corresponding hidden classes, classi to classy, would not be observed, 
but interest would be in performing inference about the parameters needed to specify the 
hidden classes. The learning problem is represented in Figure 10. This includes two added 
features: an explicit representation of the model parameters cj) and 9, and a representation 
of the sample as N repeated subgraphs. The parameter cj) (a vector of class probabilities) 




Figure 10: Learning the simple classification 



gives the proportions for the hidden classes, and the three parameters 9i, 02 and 9^ give 
how the variables are distributed within each hidden class. For instance, if there are 10 
classes, then is a vector of 10 class probabilities such that the prior probability of a case 
being in class c is (^c- If vari is a binary variable, then 9i would be 10 probabilities, one 
for each class, such that if the case is known to be in class c, then the probability vari is 
true is given by 9i^c and the probability vari is false is given by 1 — 9i^c- This yields the 
following equations: 

p(class = c|(^, M) = 7 
p(varj = true\class = c,9j,M) = 9j^c ■ 
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The unknown model parameters (j), 9i, 02 and 63 are included in the graphical model to 
explicitly represent all unknown variables and parameters in the learning problem. 

3.1 Introduction to Bayesian learning 

Now is a useful time to introduce the basic terminology of Bayesian learning theory. This is 
not an introduction to the field. Introductions are given in (Bretthorst, 1994; Press, 1989; 
Loredo, 1992; Bernardo & Smith, 1994; Cheeseman, 1990). This section reviews notions 
such as the sample likelihood and Bayes factor, important for subsequent results. 

For the above unsupervised learning problem there is the model, M, which is the use of 
the hidden class and the particular graphical structure of Figure 10. There are data assumed 
to be independently sampled, and there are the parameters of the model (cj), 9i, etc.). In 
order to use the theory, it must be assumed that the model is correct. That is, the "true" 
distribution for the data can be assumed to come from this model with some parameters. In 
practice, hopefully the model assumptions are sufficiently close to the truth. Different sets 
of model assumptions may be tried. Typically, the "true" model parameters are unknown, 
although there may be some rough idea about their values. Sometimes, several models are 
considered (for instance different kinds of Bayesian networks), but it is assumed that just 
one of them is correct. Model selection or model averaging methods are used to deal with 
them. 

For the Bayesian classifier above, a subjective probability is placed over the model 
parameters, in the form ^1, ^2, ^sl Af). This is called the prior probability. Bayesian 
statistics and decision theory is distinguished from all other statistical approaches in that 
it places initial probability distributions, the prior probability, over unknown model pa- 
rameters. If the model is a feed-forward neural network, then a prior probability needs to 
be placed over the network weights and the standard deviation of the error. If the model 
is linear regression with Gaussian error, then it is over the linear parameters 9 and the 
standard deviation of the error. Prior probabilities are an active area of research and are 
discussed in most introductions to Bayesian methods. 

The next important component is the sample likelihood, which, on the basis of the model 
assumptions M and given a set of parameters cj), 9i, ^2 and ^3, says how likely the sample of 
data was. This is p{sample\(j),9i,92,9'i, M). The model needs to completely determine the 
sample likelihood. The sample likelihood is the basis of the maximum likelihood principle 
and many hypothesis testing methods (Casella & Berger, 1990). This combines with the 
prior to form the posterior probability: 



This equation is Bayes theorem and the term p{sample\M) is derived from the prior and 
sample likelihood using an integration or sum that is often difficult to do: 



This term is called the evidence for model M , or model likelihood, and is the basis for 
most Bayesian model selection, model averaging methods, and Bayesian hypothesis testing 



p{4>, 9i, 92, 9'i\sample, M) 



p{sample\4>, 9i, ^2, ^3, M)p((^, 9i, 92, ^sjM) 
p(sample\M ) 




p(sample\(t>,9i,92,9s,M)p((t>,9i,92,9s\M) d(<^, ^1, ^2, ^3) • (10) 
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methods using Bayes factors (Smith & Spiegelhalter, 1980; Kass & Raftery, 1993). The 
Bayes factor is a relative quantity used to compare one model Mi with another M2: 



Bayes- factor(M2, Mi 



p{sample\M2) 
p(sample\Mi) 



Kass and Raftery (1993) review the large variety of methods available for computing or 
estimating the evidence for a model including numerical integration, importance sampling, 
and the Laplace approximation. In implementation, the log of the Bayes factor is used to 
keep the arithmetic within reasonable bounds. The log of the evidence can still produce 
large numbers, and since rounding errors in floating point arithmetic scales with the order 
of magnitude, the log Bayes factor is the preferred quantity to consider in implementation. 
The evidence is often simpler in mathematical analysis. The Bayes factor is the Bayesian 
equivalent to the likelihood ratio test used in orthodox statistics and developed by Wilks. 
See Casella and Berger (1990) for an introduction and Vuong (1989) for a recent review. 

The evidence and Bayes factors are fundamental to Bayesian methods. It is often the 
case that a complex "non-parametric" model (a statistical term that loosely translates as 
"many and varied parameter" model) be used for a problem, rather than a simple model 
with some fixed number of parameters. Examples of such models are decision trees, most 
neural networks, and Bayesian networks. For instance, suppose two models are proposed 
with Ml and M2 being two Bayesian networks suggested by the domain expert. These 
are given in Figure 11. They are over two multinomial variables vari and var2 and two 
Gaussian variables xi and X2. Model M2 has an additional arc going from the discrete 
variable var2 to the real valued variable xi. 



Model = M 




Model = M2 



Gnussian 



Gaussian 



N 




Figure 11: Two graphical models. Which should learning select? 



The parameters 01,02, l^i,(Ji, 1^2, for model Mi parameterize probability distributions 
for the first Bayesian network, and the parameters 0i,02, i^2t'^2 for the second. The 

task is to learn not only a set of parameters, but also to select a Bayesian network from the 
two. The Bayes factor gives the comparative worth of the two models. This simple example 
extends in principle to selecting a single decision tree, rule set, or Bayesian network from 
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the huge number available from attributes in the domain. In this case compare the posterior 
probabilities of the two models p(Mi\sample) and p(M2\sample). Assuming the truth falls 
in one or other model, the first is computed using Bayes theorem as: 

p(sample\Mi)p(Mi) 



p^Milsample) 



p(sample\Mi)p(Mi) + p(sample\M2)p(M2 
1 

I + Bayes-factor{M2,Miy^ ' 



More generally, when multiple models exist, it still holds that: 

p{M2\sample) s,y,,_ f,,,,,^M2, M,^^^^'^ 



p(Mi\sample) ' p{Mi) 

Notice that the computation requires of each model its prior and its evidence. The sec- 
ond form reduces the computation to the relative quantity being the Bayes factor, and 
a ratio of the priors. Bayesian hypothesis testing corresponds to checking if the Bayes 
factor of the null hypothesis compared to the alternative hypothesis is very small or very 
large. Bayesian model building corresponds to searching for a model with a high value of 
p(M\sample) oc p(sample\M)p(M), which usually involves comparing Bayes factors of this 
model with alternative models during the search. 

When making an estimate about a new case x, the estimate becomes: 

p(x\sample, {Mi, M2}) 

= p(Mi\sample) p(x\sample, Ml) + p(M2\sample) p(x\sample, M2) 

p(sample\Mi) p(Mi) p(x\sample, Mi) + p(sample\M2) p>{M2) p(x\sample, M2) 
p(sample\Mi) p(Mi) + p(sample\M2) p>{M2) 

The predictions of the individual models is averaged according to the model posteriors 
p){Mi\samp)le) and p(M2\sample) = 1 — p(Mi\sample). The general components used in 
this calculation are the model priors, the evidence for each model or the Bayes factors, and 
the prediction for the new case made for each model. 

This process of model averaging happens in general. A typical non-parametric problem 
would be to learn class probability trees from data. The number of class probability tree 
models is super-exponential in the number of features. Even when learning Bayesian net- 
works from data the number of Bayesian networks is at best quadratic in the number of 
features. Doing an exhaustive search of these spaces and doing the full averaging implied by 
the equation above is computationally infeasible in general. It may be the case that 15 mod- 
els have posterior probabilities p(M\sample) between 0.1 and 0.01, and several thousand 
more models have posteriors from 0.001 to 0.0000001. Rather than select a single model, a 
representative set of several models might be chosen and averaged using the identity: 

p^xlsample) = ^p(Mj|samp/e)p(a;|samp/e,Mj) . 

i 

The general averaging process is depicted in Figure 12 where a Gibbs sampler is used to 
generate a representative subset of models with high posterior. This kind of computation is 
done for class probability trees where representative sets of trees are found using a heuristic 
branch and bound algorithm (Buntine, 1991b), and for learning Bayesian networks (Madi- 
gan & Raftery, 1994). A sampling scheme for Bayesian networks is presented in Section 8.3. 
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Figure 12: Averaging over multiple Bayesian networks 



3.2 Plates: representing learning with graphical models 

In their current form, graphical models do not allow the convenient representation of a 
learning problem. There are four important points to be observed regarding the use of 
graphical models to improve their suitability for learning. Consider again the unsupervised 
learning system described in the introduction of Section 3. 

• The unknown model parameters ^i, d^, and ^3 are included in the graphical model to 
explicitly represent all variables in the problem, even model parameters. By including 
these in the probabilistic model, an explicitly Bayesian model is constructed. Every 
variable in a graphical model, even unknown model parameters, has a defined prior 
probability. 

• The learning sample is a repeated set of measured variables so the basic model of 
Figure 9 appears duplicated as many times as there are cases in the sample, as shown 
in Figure 10. Clearly, this awkward repetition will occur whenever homogeneous data 
is being modeled (typical in learning). Techniques for handling this repetition form a 
major part of this paper. 

• Neither graph in Figures 9 and 10 represents the goal of learning. For learning to be 
goal directed, additional information needs to be included in the graph: how is learned 
knowledge evaluated or how can subsequent performance be measured? This is the 
role of decision theory and it is modeled in graphical form using infiuence diagrams 
(Shachter, 1986). This is not discussed here, but is covered in (Buntine, 1994). 

• Finally, it must be possible to take a graphical representation of a learning problem 
and the goal of learning and construct an algorithm to solve the problem. Subsequent 
sections discuss techniques for this. 

Consider a simplified version of the same unsupervised problem. In fact, the simplest 
possible learning problem containing uncertainty goes as follows: there is a biased coin with 
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Figure 13: Tossing a coin: model without and with a plate 



an unknown bias for heads 9. That is, the long-run frequency of getting heads for this coin 
on a fair toss is 9. The coin is tossed N times and each time the binary variable headsi 
is recorded. The graphical model for this is in Figure 13(a). The headsi nodes are shaded 
because their values are given, but the 9 node is not. The 9 node has a Beta(1.5, 1.5) prior. 
This assumes 9 is distributed according to the Beta distribution with parameters cii = 1.5 
and Q!2 = 1.5, 

K^l«i,«2) = — (11) 

where Beta(, ) is the standard beta function given in many mathematical tables. This 
prior is plotted in Figure 14. A Beta(1.0, 1.0) prior, for instance, is uniform in 9, whereas 
Beta(1.5, 1.5) slightly favors values closer to 0.5 — a fairer coin. Figure 13(b) is an equivalent 




Figure 14: The Beta(1.5, 1.5) prior (a = 1.5) and other priors on 9 

graphical model using the notation of plates. The repeated group, in this case the headsi 
nodes, is replaced by a single node with a box around it. The box is referred to as a plate. 
and implies that 

• the enclosed subgraph is duplicated N times (into a "stack" of plates), 

• the enclosed variables are indexed, and 
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• any exterior-interior links are duplicated. 

In Section 2.1 it was shown that any Bayesian network has a corresponding form for 
the joint probability of variables in the Bayesian network. The same applies to plates. The 
plate indicates that a product (JJ) will appear in the corresponding form. The probability 
equation for Figure 10, read directly from the graph, is: 

01, 02, 03, classi, vari^i, var2^i, var^^i, . . . , classy, vari^N, var2^N, var^^N) = 
p{<j)) p{0i) p{^2) vi^'i) p{classi\(j)) p{vari^i\classi, 0i) p){var2^i\classi, 02) p(var3^i\classi, ^3) 
. . . p(class]\f\(l)) p(vari^]\f\class]\f, ^1) p(f ar2,jv|c/assjv, ^2) p(var3^]\f\class]\f, ^3) . 

The corresponding equation using product notation is: 

p((l),0i,02,03,class,,vari^„var2,i,var3^, : i=l,...,N) = p{(])) p{0i) p{02) p{03) 

N 

p{classi\(f)) p[vari^i\classi, 0i) p[var2^i\classi, ^2) p{'var3^i\classi, ^3) . 

8 = 1 

These two equations are equivalent. However, the differences in their written form corre- 
sponds to the differences in their graphical form. Each plate is converted into a product: 
the joint probability ignoring the plates is written, a product (JJ) is added for each plate 
to index the variables inside it. If there are two disjoint plates then there are two disjoint 
products. Overlapping plates yield overlapping products. The corresponding transforma- 
tion for the unsupervised learning problem of Figure 10 is given in Figure 15. Notice, the 




Figure 15: Simple unsupervised learning, with a plate 

hidden class variable is not shaded so it is not given. The corresponding transformation 
for the supervised learning problem of Figure 10, where the classes are given, and thus 
corresponds to the idiot's Bayes classifier, is identical to Figure 15 except that the class 
variable is shaded because the classes are now part of the training sample. 

Many learning problems can be similarly modeled with plates. Write down the graphical 
model for the full learning problem with only a single case provided. Put a box around the 
data part of the model, pull out the model parameters (for instance, the weights of the 
network or the classification parameters), and ensure they are unshaded because they are 
unknown. Now add the data set size (N) to the bottom left corner. 

The notion of a plate is formalized below. This formalization is included for use in 
subsequent proofs. 
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Definition 3.1 A chain graph G with plates on variable set X consists of a chain graph G' 
on variables X with additional boxes called plates placed around groups of variables. Only 
directed arcs can cross plate boundaries, and plates can be overlapping. Each plate P has 
an integer Np in the bottom left corner indicating its cardinality. Each plate indexes the 
variables inside it with values i = 1, . . . , Np. Each variable V £ X occurs in some subset of 
the plates. Let indval(V) denote the set of values for indices corresponding to these plates. 
That is, indval(V) is the cross product of index sets {1, . . .,iVp} for plates P containing 
V. 

A graph with plates can be expanded to remove the plates. Figure 10 is the expanded 
form of Figure 15. Given a chain graph with plates G on variables X, construct the expanded 
graph as follows: 

• For each variable V £ X , add a node for Vi for each i G indvar(V). 

• For each undirected arc between variables U and V, add an undirected arc between 
Ui and Vi for i G indvar(V) = indvar(U ). 

• For each directed arc between variables U and V, add a directed arc between Ui and 
Vj for i G indvar(V) and j G indvar(V) where i and j have identical values for index 
components from the same plate. 

The parents for indexed variables in a graph with plates are the parents in the expanded 
graph. 

parents{U) = parents{Ui) . 

i^indval[U) 

A graph with plates is interpreted using the following product form. If the product form 
for the chain graph G' without plates with chain components T is 

p(X\M(G')) = Y[p(T\parent.s(T),M) , 

then the product form for the chain graph G with plates has a product for each plate: 

p(X\M(Gj) = H n p(T,\parent.3(Ti),M) . (12) 

This is given by the expanded version of the graph. Testing for independence on chain 
graphs with plates involves expanding the plates. In some cases, this can be simplified. 

4. Exact operations on graphical models 

This section introduces basic inference methods on graphs without plates and exact inference 
methods on graphs with plates. While there are no common machine learning algorithms 
explained in this section, the operations explained are the mathematical basis of most fast 
learning algorithms. Therefore, the importance of these basic operations should not be 
underestimated. Their use within more well-known learning algorithms is explained in later 
sections. 
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Once a graphical model is developed to represent a problem, the graph can be manip- 
ulated using various exact or approximate transformations to simplify the problem. This 
section reviews basic exact transformations available: arc reversal, node removal, and exact 
removal of plates by recursive arc reversal. The summary of operations emphasizes the 
computational aspects. A graphical model has an associated set of definitions or tables for 
the basic functions and conditional probabilities implied by the graph, the operations given 
below effect both the graphical structure and these underlying mathematical specifications. 
In both cases, the process of making these transformations should be constructive so that 
a graphical specification for a learning problem can be converted into an algorithm. 

There are several generic approaches for performing inference on directed and undirected 
networks without plates. These approaches are mentioned, but will not be covered in 
detail. The first approach is exact and corresponds to removing independent or irrelevant 
information from the graph, then attempting to optimize an exact probabilistic computation 
by finding a reordering of the variables. The second approach to performing inference is 
approximate and corresponds to approximate algorithms such as Gibbs sampling, and other 
Markov chain Monte Carlo methods (Hrycej, 1990; Hertz et al., 1991; Neal, 1993). In 
some cases, the complexity of the first approach is inherently exponential in the number 
of variables, so the second can be more efficient. The two approaches can be combined in 
some cases after appropriate reformulation of the problem (Dagum & Horvitz, 1992). 

4.1 Exact inference without plates 

The exact inference approach has been highly refined for the case where all variables are 
discrete. It is not surprising that available algorithms have strong similarities (Shachter, 
Andersen, & Szolovits, 1994) since the major choice points involve the ordering of the 
summation and whether this ordering is selected dynamically or statically. Other special 
classes of inference algorithms include the cases where the model is a multivariate Gaussian 
(Shachter & Kenley, 1989; Whittaker, 1990), or corresponds to some specific diagnostic 
structure, such as two-level believe networks with a level of symptoms connected to a level 
of diseases (Henrion, 1990). This subsection reviews some simple, exact transformations on 
graphical models without plates. Two representative methods are covered but are by no 
means optimal: arc reversal and arc removal. They are important, however, because they 
are the building blocks on which methods for graphs with plates are based. Many more 
sophisticated variations and combinations of these algorithms exist in the literature, includ- 
ing the handling of deterministic nodes (Shachter, 1990) and chain graphs and undirected 
graphs (Frydenberg, 1990). 

4.1.1 Arc reversal 

Two basic steps for inference are to marginalize nuisance parameters or to condition on new 
evidence. This may require evaluating probability variables in a different order. The arc 
reversal operator interchanges the order of two nodes connected by a directed arc (Shachter, 
1986). This operator corresponds to Bayes theorem and is used, for instance, to automate 
the derivation of Equation (9) from Equation (8). The operator applies to directed acyclic 
graphs and to chain graphs where a and b are adjacent chain components. 
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Figure 16: Arc reversal: reversing nodes a and b 



Consider a fragment of a graphical model as given in the left of Figure 16. The equation 
for this fragment is: 

p(a,b\A) = p(b\parents(b)) p(a\b,parents(a)) . 

Suppose nodes a and b need to be reordered. Assume that between a and b there is no 
directed path of length greater than one. If there was, then reversing the arc between a 
and b would create a cycle, which is forbidden in a Bayesian network and chain graph. The 
formula for the variable reordering can be found by applying Bayes theorem to the above 
equation. 

p(^a\A) = ''^p[b\parents[b)) p[a\parents[a)) , 

b 

p(b\parents(b)) p(a\parents(a)) 
' J2b p(b\parents(b)) p(a\parents(a)) 

The corresponding graph is given in the right of Figure 16. Notice that the effect on the 
graph is that nodes for a and b now share their parents. This is an important point. If all 
of a's parents were also 5's, and vice versa, excepting b (parents(a) = parents(b) U {b}), 
then the graph would be unchanged except for the direction of the arc between a and b. 
Regardless, the probability tables or formula associated with the graph also need to be 
updated. If the variables are discrete and full conditional probability tables are maintained, 
then this operation requires instantiating the set {a, b}[Jparents(a)[Jparents(b) in all ways, 
which is exponential in the number of variables. 

4.1.2 Arc and node removal 

Some variables in a graph are part of the model, but are not important for the goal of 
the data analysis. These are called nuisance parameters. An unshaded node y without 
children (no outward going arcs) that is neither an action node nor a utility node can 
always be removed from a Bayesian network. This corresponds to leaving out the term 
p(y\parents(y)) in the product of Equation 2. Given 

p(a,b,y) = p(a)p(b\a)p(y\a,b) 

then y can be marginalized out trivially to yield: 

p(a,b) = p(a) p(b\a) . 
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More generally this applies to chain graphs — a chain component whose nodes are all un- 
shaded and have no children can be removed. If y is a node without children, then remove 
the node with y from the graph and the arcs to it; ignore the factor p(y\parents(y)) in the 
full joint form. Consider that the i-th case in Figure 10 (nodes classi, vari^i, var2^i, var^^i) 
can be removed from the model without affecting the rest of the graph. 

4.2 Removal of plates by exact methods 

Consider the simple coins problem of Figure 13 again. The graph represents the joint prob- 
ability for p(9, headsi, . . . , heads]\f ). The main question of interest here is the conditional 
probability of 9 given the data headsi, . . . ,heads]\f. This could be obtained through re- 
peated arc reversals between 9 and headsi, then between 9 and heads2, and so on, until all 
the data appears before 9 in the directed graph. Doing this repeated series of applications 
of Bayes theorem yields a fully connected graph with (N + l)N/2 arcs. The corresponding 
formula for the posterior simplified with Lemma 2.1 is also simple: 

0ai-l+p n — 0\"2-'i-+n 

p(^|/iea(isi, . . . , /learfsjv, cii = 1-5, Q!2 = 1.5) = — — (13) 

Beta( ai -\- p,a2 -\- n) 

where p is the number of heads in the sequence and n = N — p is the number of tails. 
This is a worthwhile introductory exercise in Bayesian decision theory (Howard, 1970) that 
should be familiar to most students of statistics. Compare this with Equation (11). There 
are several important points to notice about this result: 

• Effectively, this does a parameter update, a[ = ai + p and a2 = a2 + n, requiring 
no search or numerical optimization. The whole sequence of tosses, irrespective of its 
length and the ordering of the heads and tails, can be summed up with two numbers. 
These summary statistics are called sufficient statistics because, assuming the model 
used is correct, they are sufficient to explain all that is important about 9 in the data,. 

• The corresponding graph can be simplified as shown in Figure 17. The plate is effi- 
ciently removed and replaced by the sufficient statistics (two numbers) irrespective of 
the size of the sample. 



(S) ►© 

Beta(1.5+n,1.5+p) 

Figure 17: Removing the plate in the coin problem 



• The posterior distribution has a simple form. Furthermore, all the moments of 9, 
log9, and log(l — 9) for the distribution can be computed as simple functions of the 
normalizing constant, Beta(a[, a2). For instance: 

f nn^o\ - dlog Beta(a[,a'2) 

^9\headsi ,...,headsjq ,ai ,02 ^ ) 



_ Beta{a[ + l,^^^) 
^e\heads^,...,headsj,,a^,aA^) " Betaia'^.a'^) 
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This result might seem somewhat obscure, but it is a general property holding for 
a large class of distributions that allows some averages to be calculated by symbolic 
manipulation of the normalizing constant. 

4.3 The exponential family 

This result generalizes to a much larger class of distributions referred to as the exponential 
family (Casella & Berger, 1990; DeGroot, 1970). This includes standard undergraduate 
distributions such as Gaussians, Chi squared, and Gamma, and many more complex distri- 
butions constructed from simple components including class probability trees over discrete 
input domains (Buntine, 1991b), simple discrete and Gaussian versions of a Bayesian net- 
work (Whittaker, 1990), and linear regression with a Gaussian error. Thus, these are a 
broad and not insignificant class of distributions that are given in the definition below. 
Their general form has a linear combination of parameters and data in the exponential. 

Definition 4.1 A space X is independent of the parameter 9 if the space remains the same 
when just 9 is changed. If the domains of x,y are independent of 9, then the conditional 
distribution for x given y, p(x\y,9, M), is in the exponential family when 

p{x\y,9,M) = ^^^exp(^j2w,{9Mx,y)^ (14) 

for some functions Wi, ti, h and Z and some integer k, for h(x, y) > 0. The normalization 
constant Z{9) is known as the partition function. 

Notice the functional form of Equation (14) is similar to the functional form for an undi- 
rected graph of Equation (4), as holds in many cases for a Markov random field. For the 
previous coin tossing example, both the coin tossing distribution (a binomial on headsi) and 
the posterior distribution on the model parameters {9) are in the exponential family. To see 
this, notice the following rewrites of the original probabilities. These make the components 
Wi, ti and Z explicit. 

p{heads\9) = exp (lheads=true log9 + lheads=false log(l - 9)) , 

p(9\headsi, . . . ,heads]\f,ai,a2) = 

— — ^ — exp ((ai +p-l)log9 + {a2 + n-l) log(l - 9)) . 

Beta[ai -\- p,a2-\- n) 

Table 2 in Appendix B gives a selection of distributions, and their functional form. Fur- 
ther details can be found in most textbooks on probability distributions (DeGroot, 1970; 
Bernardo & Smith, 1994). 

The following is a simple graphical reinterpretation of the Pitman-Koopman Theorem 
from statistics (Jeffreys, 1961; DeGroot, 1970). In Figure 18(a), T{x^,y^) is a statistic 
of fixed dimension independent of the sample size N (corresponding to ra,p in the coin 
tossing example). The theorem says that the sample in Figure 18(a) can be summarized in 
statistics, as shown in Figure 18(b), if and only if the probability distribution for x\y,9 is 
in the exponential family. In this case, T{x^,y^) is a sufficient statistic. 
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(a) 



(b) 



N 





Figure 18: The generalized graph for plate removal 



Theorem 4.1 (Recursive arc-reversal). Consider the model M represented by the graphical 
model for a sample of size N given in Figure 18(a). Have x in the domain X and y in the 
domain Y , both domains are independent of 9, and both domains have components that are 
real valued or finite discrete. Let the conditional distribution for x given y,9 be f(x\y,9), 
which is positive for all x G X. If first derivatives exist with respect to all real valued 
components of x and y, the plate removal operation applies for all samples x^, = xi, . . . , x^, 
y* = Vit- ■ • 7 yjV; (i^d 9, as given in Figure 18(b), for some sufficient statistics T{x^,y^) of 
dimension independent of N if and only if the conditional distribution for x given y, 9 is 
in the exponential family, given by Equation (14). In this case, T{x^,y^) is an invertible 
function of the k averages: 



In some cases, this extends to domains X and Y dependent on 9 (Jeffreys, 1961). 

Sufficient statistics for a distribution from the exponential family are easily read from 
the functional form by taking a logarithm. For instance, for the multivariate Gaussian, the 
sufficient statistics are Xi for i = 1, . . .,6? and XiXj for < i < j < d, and the normalizing 
constant Z(p, S) is given by: 



As for coin tossing, it generally holds that if a sampling distribution (a binomial on 
headsi) is in the exponential family, then the posterior distribution for the model parameters 
{9) can also be cast as exponential family. This is only useful when the normalizing constant 
(this is Beta(ai + p, Q!2 + n) in the coin tossing example) and its derivatives are readily 
computed. 

Lemma 4.1 (The conjugacy property). In the context in Theorem 4-1, assume the distri- 
bution for X given y,9 can be represented by the exponential family. Factor the normalizing 
constant Z{9) into two components, Z{9) = Zi(9)Z2, where the second is the constant part 
independent of 9. Assume the prior on 9 takes the form: 



1 



N 



N 



y ] ^ii^j) '■ i — 1, . . . , k . 




p(9\t,M) 



Mr) 



exp rfc+i(log 1/Zi(^)) + r^w^{9) 



•) 



(15) 
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for some k + l dimensional parameter t, where Zg{T) is the appropriate normalizing constant 
and f{9) is any function. Then the posterior distribution for 9, p(^|r, si, . . . , sjv, Af), is 
also represented by Equation (15) with the parameters 

Tfc+i = Tfc+l + N 

N 

^' = '^^ + i = l,...,k. 

When the function f(9) is trivial, for instance uniformly equal to 1, then the distribution in 
Equation (15) is referred to as the conjugate distribution, which means it has a mathematical 
form mirroring that of the sample likelihood. The prior parameters r, by looking at the 
update equations in the lemma, can be thought of as corresponding to the sufficient statistics 
from some "prior sample" and given by r^+i. 

This property is useful for analytic and computational purposes. Once the posterior 
distribution is found, and assuming it is one of the standard distributions, the property can 
easily be established. Table 3 in Appendix B gives some standard conjugate prior distri- 
butions for those in Table 2, and Table 4 gives their matching posteriors. More extensive 
summaries of this are given by DeGroot (1970) and Bernardo and Smith (1994). The pa- 
rameters for these priors can be set using standard reference priors (Box & Tiao, 1973; 
Bernardo & Smith, 1994) or elicited from a domain expert. 

There are several other important consequences of the Pitman-Koopman Theorem or 
recursive arc reversal that should not go unnoticed. 

Comment 4.1 Ifx,y are discrete and finite valued, then the distribution p(x\y, 9) can be 
represented as a member of the exponential family. This holds because a positive finite 
discrete distribution can always be represented as an extended case statement in the form 

p(x\y,9) = exp ^ h,{x,y)M.0)] 
\i=i,k / 

where the boolean functions ti(x,y) are a set of mutually exclusive and exhaustive conditions. 
The indicator function 1a has the value 1 if the boolean A is true and otherwise. The 
main importance of the exponential family is in continuous or integer domains. Of course, 
since a large class of functions log p(x\y, 9) can always be approximated arbitrarily well by a 
polynomial in x,y and 9 with sufficiently many terms, the exponential family covers a broad 
class of distributions. 

The application of the exponential family to learning is perhaps the earliest published 
result on computational learning theory. The following two interpretations of the recursive 
arc reversal theorem are relevant mainly for distributions involving continuous variables. 

Comment 4.2 An incremental learning algorithm with finite memory must compress the 
information it has seen so far in the training sample into a smaller set of statistics. This 
can only be done without sacrificing information in the sample, in a context where all 
probabilities are positive, if the hypothesis or search space of learning is a distribution from 
the exponential family. 
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Comment 4.3 The computational requirements for learning an exponential family distri- 
bution are guaranteed to be linear in the sample size: first compute the sufficient statistics 
and then learning proceeds independently of the sample size. This could be exponential in 
the dimension of the feature space, however. 

Furthermore, in the case where the functions Wi are full rank in 9 (dimension of 9 is k, 

same as w, and the Jacobian of w with respect to 9 is invertible, det ^ ^^^P ^ various 

moments of the distribution can easily be found. For this situation, the function , when 
it exists, is called the link function (McCuUagh & Nelder, 1989). 

Lemma 4.2 Consider the notation of Definition 4-1- If the link function for an ex- 
ponential family distribution exists, then moments of functions ofti(x,y) and exp (ti(x,y)) 
can be expressed in terms of derivatives and direct applications of the functions Z , ti, Wi, 
and . If the normalizing constant Z and the link function are in closed form, then so 
will the moments. 

Techniques for doing these symbolic calculations are given in Appendix B. Exponential 
family distributions then fall into two groups. There are those where the normalizing 
constant and link function are known, such as the Gaussian. One can efficiently compute 
their moments and determine the functional form of their conjugate distributions up to 
the normalizing constant. For others this is not the case. For others, such as a Markov 
random field used in image processing, moments can generally only be computed by an 
approximation process like Gibbs sampling given in Section 7.1. 

4.4 Linear regression: an example 

As an example, consider the problem of linear regression with Gaussian error described in 
Figure 19. This is an instance of a generalized linear model and has a linear construction 




Figure 19: Linear regression with Gaussian error 
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at its core. The M basis functions are known deterministic functions of the input vari- 
ables si, . . . , Xn- These would typically be nonlinear orthogonal functions such as Legendre 
polynomials. These combine linearly with the parameters 9 to produce the mean m for the 
Gaussian. 

The corresponding learning problem represented as plates is expressed in Figure 20. The 




Gaussian 



Inverse Gamma 



Figure 20: The linear regression problem 
joint probability for this model is as follows: 

1 Ylf=i {y^-Y.^=i OjbasiSj(x^^,)^ 



p{a)piOW) 



2Traj 



,N' 



where x^^i denotes the vector of values for the i-th datum, xi^i, . . .,Xn,i- Linear regression 
with Gaussian error falls in the exponential family because a Gaussian is in the exponential 
family and the mean of the simple Gaussian is a linear function of the regression parameters 
(see Lemma 5.1). 

For this case, the correspondence to the exponential family is drawn as follows. The 
individual data likelihoods, p(y\xi, . . . , Xn, 0, a), need only be considered. Expand the prob- 
ability to show it is a linear sum of data terms and parameter terms. 



p{y\xi, . . .,; 
1 
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— I y -J^basisjix, 
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exp I —'^~^y^ — ^ basis j{x )hasisk{x ) 



2(72 



M 



Y,basisj{x)y^ 



The data likelihood in this last line can be seen to be in the same form as the general expo- 
nential family where the sufficient statistics are the various data terms in the exponential. 
Also, the link function does not exist because there are M parameters and M(M + l)/2 
sufficient statistics. 
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Gaussian Inverse Gamma 

Figure 21: The linear regression problem with the plate removed 

The model of Figure 20 can therefore be simplified to the graph in Figure 21, where 
q and S are the usual sample means and covariances obtained from the so-called normal 
equations of linear regression. 5 is a matrix of dimension M (the number of basis functions) 
and g is a vector of dimension M . 

ysq 

These three sufficient statistics can be read directly from the data likelihood above. 
Consider the formula: 

Differentiating with respect to 9i shows that the expected value of y given si, . . and 
9, a is the mean (as expected): 

M 

£y\xt,...,xn,e,a{y) = m = Y^hasisj{x)ej . 

1=1 

Differentiating with respect to a shows that the expected error from the mean is (again 
expected): 




Higher-order derivatives give formula for higher-order moments such as skewness and kur- 
tosis (Casella & Berger, 1990), which are functions of the second, third and fourth central 
moments. While these are well known for the Gaussian, the interesting point is that these 
formula are constructed by differentiating the component functions in Equation (14) with- 
out recourse to integration. Finally, the conjugate distribution for the parameters 9 in this 
linear regression problem is the multivariate Gaussian distribution when a is known, and 
for is the inverted Gamma. 
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5. Recognizing and using the exponential family 

How can recursive arc reversal be applied automatically to a graphical model? First, when a 
graphical model or some subset of a graphical model falls in the exponential family needs to 
be identified. If each conditional distribution in a Bayesian network or chain component in 
a chain graph is exponential family, then the full joint is exponential family. The following 
lemma gives this with some additional conditions for deterministic nodes. This applies to 
Bayesian networks using Comment 2.2. 

Lemma 5.1 A chain graph has a single plate. Let the non-deterministic variables inside 
the plate be X , and the deterministic variables be Y . Let the variables outside the plate be 
e. If: 

L All arcs crossing a plate boundary are directed into the plate. 

2. For all chain components t, the conditional distribution p(T\parents(T)) is from the 
exponential family with data variables from (X, Y) and model parameters from 9; 
furthermore log p(T\parents(T), 9) is a polynomial function of variables in Y . 

3. Each variable y (zY can be expressed as a deterministic function of the form 

y = J2u,iX)v,i9) 

8 = 1 

for some functions Ui , Vi . 
Then the conditional distribution p{X,Y\9) is from the exponential family. 

Second, how can these results be used when the model does not fall in the exponential 
family? There are two categories of techniques available in this context. In both cases, 
the algorithms concerned can be constructed from the graphical specifications. The two 
new classes together with the recursive arc reversal case are given in Figure 22. In each 
case, (I) denotes the graphical configuration and (II) denotes the operations and simplifica- 
tions performed by the algorithm. When the various normalization constants are known in 
closed form and appropriate moments and Bayes factors can be computed quickly, all three 
algorithm schemas have reasonable computational properties. 

The first category is where a useful subset of the model does fall into the exponential 
family. This is represented by the partial exponential family in Figure 22. The part of the 
problem that is exponential family is simplified using the recursive arc reversal of Theo- 
rem 4.1, and the remaining part of the problem is typically handled approximately. Decision 
trees and Bayesian networks over multinomial or Gaussian variables also fall into this cate- 
gory. This happens because when the structure of the tree or Bayesian network is given the 
remaining problem is composed of a product of multinomials or Gaussians. This is the basis 
of various Bayesian algorithms developed for these problems (Buntine, 1991b; Madigan & 
Raftery, 1994; Buntine, 1991c; Spiegelhalter, Dawid, Lauritzen, & Cowell, 1993; Hecker- 
man, Geiger, & Chickering, 1994). Strictly speaking, decision trees and Bayesian networks 
over multinomial or Gaussian variables are in the exponential family (see Comment 4.1). 
However, it is more computationally convenient to treat them this way. This category is 
discussed more in Section 8. 
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Figure 22: Three categories of algorithms using the exponential family 



The second category is where, if some hidden variables are introduced into the data, the 
problem becomes exponential family if the hidden values are known. This is represented 
by the mixture model in Figure 22. Mixture models (Titterington, Smith, & Makov, 1985; 
Poland, 1994) are used to model unsupervised learning, incomplete data in the classifica- 
tion problems, robust regression, and general density estimation. Mixture models extend 
the exponential family to a rich class of distributions, so this second category is an impor- 
tant one in practice. General methods for handling these problems correspond to Gibbs 
sampling (and other Markov chain Monte Carlo methods) discussed in Section 7.2 and its 
deterministic counterpart the expectation maximization algorithm, discussed in Section 7.4. 
As shown in Figure 22, these algorithms cycle back and forth between a process that re- 
estimates c given 9 using first-order inference and a process that uses the fast exponential 
family algorithms to re-estimate 9 given c. 

6. Other operations on graphical models 

The recursive arc reversal theorem of Section 4.2 characterizes when plates can be readily 
removed and the sample summarized in some statistics. Outside of these cases, more general 
classes of approximate algorithms exist. Several of these are introduced in Section 7 and 
more detail is given, for instance, by Tanner (1993). These more general algorithms require 
a number of basic operations be performed on graphs: 

Decomposition: A learning problem can sometimes be decomposed into simpler sub- 
problems with each yielding to separate analysis. One form of decomposition of 
learning problems is considered in Section 6.2. Another related form that applies 
to undirected graphs is developed by Dawid and Lauritzen (1993). Other forms of de- 
composition can be done at the modeling level, where the initial model is constructed 
in a manner requiring fewer parameters, as is Heckerman's similarity networks (1991). 
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Exact Bayes factors: Model selection and averaging methods are used to deal with 
multiple models (Kass & Raftery, 1993; Buntine, 1991b; Stewart, 1987; Madigan 
& Raftery, 1994). These require the computation of Bayes factors for models con- 
structed during search. Exact methods for computing Bayes factors are considered in 
Section 6.3. 

Derivatives: Various approximation and search algorithms require derivatives be calcu- 
lated, as discussed next. 

6.1 Derivatives 

An important operation on graphs is the calculation of derivatives of parameters. This 
is useful after conditioning on the known data to do approximate inference. Numerical 
optimization using derivatives can be done to search for MAP values of parameters, or 
to apply the Laplace approximation to estimate moments. This section shows how to 
compute derivatives using operations local to each node. The computation is therefore 
easily parallelized, as is popular, for instance, in neural networks. 

Suppose a graph is used to compile a function that searches for the MAP values of 
parameters in the graph conditioned on the known data. In general, this requires use of 
numerical optimization methods (Gill, Murray, & Wright, 1981). To use a gradient descent, 
conjugate gradient or Levenberg-Marquardt approach requires calculation of first deriva- 
tives. To use a Newton- Raphson approach requires calculation of second derivatives, as 
well. While this could be done numerically by difference approximations, more accurate 
calculations exist. Methods for symbolically differentiating networks of functions, and piec- 
ing together the results to produce global derivatives are well understood (Griewank & 
Corliss, 1991). For instance, software is available for taking a function defined in Fortran, 
C-|--|- code, or some other language, to produce a second function that computes the exact 
derivative. These problems are also well understood for feed-forward networks (Werbos, 
McAvoy, & Su, 1992; Buntine & Weigend, 1994), and graphical models with plates only 
add some additional complexity. The basic results are reproduced in this section and some 
simple examples given to highlight special characteristics arising from their use with chain 
graphs. 

Consider the problem of learning a feed-forward network. A simple feed-forward network 
is given in Figure 23(a). The corresponding learning problem is given in Figure 23(b), 
representing the feed-forward network as a Bayesian network. Here the sigmoid units of the 
network are modeled with deterministic nodes, and the network output represents the mean 
of a bivariate Gaussian with inverse variance matrix S. Because of the nonlinear sigmoid 
function making the deterministic mapping from inputs xi,X2, to the means mi, 1112, this 
learning problem has no reasonable component falling in the exponential family. A rough 
fallback method is to calculate a MAP value for the weight parameters. This would be 
the method used for the Laplace approximation (Buntine & Weigend, 1991; MacKay, 1992) 
covered in (Tanner, 1993; Tierney & Kadane, 1986). The setting of priors for feed-forward 
networks is difficult (MacKay, 1993; Nowlan & Hinton, 1992; Wolpert, 1994), and it wiU 
not be considered here other than assuming a prior is used, p(w). The graph implies the 
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(a) 



(b) 




Xi X2 X3 




Figure 23: Learning a feed-forward network 



following posterior probability: 



wi, . . ., W5 I oi,„02,„a;i,„a;2,„a;3,^ : i=l,...,iV) 
^ deti/2s 



(16) 



/ X / X TT det'/^S /I, ,x ^ A 

. . ., W5) 11 — — — exp y^{Oz - m^)'i;(o^ - m,) j 



mi = Sigmoid(v)\h) 
hi = S igmoid{w\,2^) ■ 



The undirected clique on the parameters w indicates the prior has a term p(w). Suppose 
the posterior is differentiated with respect to the parameters W4. The result is well known 
to the neural network community since this kind of calculation yields the standard back- 
propagation equations. 

Rather than work through this calculation, instead look at the general case. To develop 
the general formula for differentiating a graphical model, a few more concepts are needed. 
Deterministic nodes form islands of determinism within the uncertainty represented by the 
graph. Partial derivatives within each island can be calculated via recursive use of the chain 
rule, for instance, by forward or backward propagation of derivatives through the equations. 
For instance, forward propagation for the above network gives: 



dnii 



E 



dmi dhi 
dhi dw4 



This is called forward propagation because the derivatives with respect to W4 are propagated 
forward in the network. In contrast, backward propagation would propagate derivatives of 
nil with respect to different variables backwards. For each island of determinism, the 
important variables are the output variables, and their derivatives are required. So for the 
feed-forward network above, partial derivatives of mi, m2, and with respect to W4 are 
required. 
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Definition 6.1 T/ie non-deterministic children of a nodex, denoted ndchildren(x), are the 
set of non-deterministic variables y such that there exists a directed path from x to y given 
by x,yi, . . . ,yn,y, with all intermediate variables (yi, . . . , y^) being deterministic. The non- 
deterministic parents of a node x, denoted ndparents(x), are the set of non-deterministic 
variables y such that there exists a directed path from y to x given by y,yi, . . . , y^, x, with all 
intermediate variables (yi,. ■ - ^yn) being deterministic. The deterministic children of a node 
X, denoted detchildren(x), are the set of deterministic variables y that are children of x. 
The deterministic parents of a node x, denoted detparents(x), are the set of deterministic 
variables y that are parents of x. 

For instance, in the model in Figure 23, the non-deterministic children of are oi and 02. 
Deterministic nodes can be removed from a graph by rewriting the equations represented 
into the remaining variables of the graph. Because some graphical operations do not apply 
to deterministic nodes, this removal is often done implicitly within a theorem. This goes as 
follows: 

Lemma 6.1 A chain graph G with nodes X has deterministic nodes Y C X . The chain 
graph G' is created by adding to G a directed arc from every node to its non-deterministic 
children, and by deleting the deterministic nodes Y . The graphs G and G' are equivalent 
probability models on the nodes X — Y . 

The general formula for differentiating Bayesian networks with plates and deterministic 
nodes is given below in Lemma 6.2. This is nothing more than the chain rule for differentia- 
tion, but it is important to notice the network structure of the computation. When partial 
derivatives are computed over networks, there are local and global partial derivatives that 
can be different. Consider the feed-forward network of Figure 23 again. On this figure, place 
an extra arc from W4 to m2. Now consider the partial derivative of m2 with respect to W4. 
The value of m2 is infiuenced by W4 directly, as the new arc shows, and indirectly via /i2. 
When computing a partial derivative involving indirect infiuences, we need to differentiate 
between the direct and indirect effects. Various notations are used for this (Werbos et al., 
1992; Buntine & Weigend, 1994). Here the notation of a local versus global derivative is 
used. The local partial derivative is subscripted with an /, d/di and represents the partial 
derivative computed at the node using only the direct infiuences — the parents. For the ex- 
ample of the partial derivative of m2 with respect to W4, the various local partial derivatives 
combine to produce the global partial derivative: 

dm2 dm2 dm2 dh2 



dw4 diW4 dih2 diW4 

This is equivalent to: 

dm2 dm2 dm2 dh2 dm2 dhi 



dw4 diW4 dih2 diW4 dihi diW4 



since |^ = 0. 



In general, the (global) partial derivative for an index variable 6i is the sum of the 
• local partial derivative at the node containing 9i, 
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• the partial derivatives for each child of 9i that is also a non-deterministic child, and 

• combinations of (global) partial derivatives for deterministic children found by back- 
ward or forward propagation of derivatives. 

Lemma 6.2 (Differentiation). A model M is represented by a Bayesian network G with 
plates and deterministic nodes on variables X . Denote the known variables in X by K and 
the unknown variables by U = X — K . Let the conditional probability represented by the 
graph G be p(U\K, M). Let 9 be some unknown variable in the graph, and let Ind(e) ^ ^ 
is non-deterministic and otherwise. If 9 occurs inside a plate then let i be some arbitrary 
valid index (i G indval{9)), otherwise let i be null. Then: 

dlogp(U\I{,M) _ dlogp(9,\parents(9ij) 

d log p(x\parents(x)) 
~^ ' d 9' 

xEndchildren{6i )nchildren{6i ) ^ 



d log p(x\parents(x)) dy 



x^ndchildren[6i) y^detparents[x),y:^6i ^ 

Furthermore, ifYcU is some subset of the unknown variables, then the partial derivative 
of the probability of Y given the known variables, p(Y\I{ , M ) is an expected value of the 
above probabilities: 

dlogp{Y\Ii',M) _ f dlogp{U\Ii',M) \ 

- £u-y\y,k,m[ ) . (18) 

Equation (17) contains only one global partial derivative which is inside the double sum on 
the right side. This is the partial derivative dy/d9i and can be computed from its local island 
of determinism using the chain rule of differentiation, for instance, using forward propaga- 
tion from 9i^s deterministic children, or backward propagation from 9i^s non-deterministic 
parents. 

To apply the Differentiation Lemma on problems like feed-forward networks or unsuper- 
vised learning, the lemma needs to be extended to chain graphs. This means differentiating 
Markov networks as well as Bayesian networks, and handling the expected value in Equa- 
tion (18). These extensions are explained below after first giving two examples. 

As a first example, consider the feed-forward network problem of Figure 23. By treating 
the two output units in the feed-forward network as a single variable, a Cartesian product 
(01,02), the above Differentiation Lemma can now be applied directly to the feed-forward 
network model of Figure 23. This uses the simplification given in Section 2.4 with Com- 
ment 2.1. Let P be the joint probability for the feed-forward network model, given in 
Equation (16). The non-deterministic children of W4 are the single chain component con- 
sisting of the two variables oi and 02. Its parents are the set {mi, 7712}. Consider the 
Differentiation Lemma. There are no children of W4 that are also non-deterministic, so 
the middle sum in Equation (17) of the Differentiation Lemma is empty. Then the lemma 
yields, after expanding out the inner most sum: 

dlogP dlog p(w) 

dw diW4 
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A / 91ogp(oi,„ 02,iiS, mi, m2) dnii ^ 91ogp(oi^„ 02,»ii;, mi, 9m2 \ 
^ V dinii diW4 dim2 81104) 

where p(oi^j', 02,i|S, mi, m2) is the two-dimensional Gaussian, and |^ is from the global 
derivative but evaluates to a local derivative. 

As a second example, reconsider the simple unsupervised learning problem given in the 
introduction to Section 3. The likelihood for a single datum given the model parameters is 
a marginal of the form: 

10 

p(vari = l,var2 = 0,var3 = l\(l),9) = ^ <?!>c 6*1,0 (1 - ^'2,0) ^'s.c • 

c=l 

Taking the logarithm of the full case probability p(class, vari, var2, var^lcj), 9) reveals the 
vectors of components w and t of the exponential distribution: 

log p(c/ass, vari, var2, var^lcj), 9) 
10 3 10 

— ^ ^ f c/ass=c l*^g 0c ~1~ ^ ^ ^ ^ ^lc/ass=c,uarj =triie ^j,c ~1~ ^class=c,varj=false l*^g(l 
c=l i = l c=l 

Notice that the normalizing constant Z((l), 9) is 1 in this case. Consider finding the partial 
derivative 91ogp(f ari, var2, var'i\(j), ^)/9^2,5- This is done for each case when differentiating 
the posterior or the likelihood of the unsupervised learning model. Applying Equation (18) 
to this yields: 

91ogp(f ari, f ar2, var'i\(j), 9) 



d92,5 

^° 9 log 9 2,d 
d=i 



<jV2,b 



, 91og(l-M . n ^ 

<jV2,b 

= -J—'i-var2=true p{class = b\vari, var2, vars, (/), 9) 

^2,5 

+ - — \ — lyar2=falseP{class = 5\vari, var2, vurs, (/), 9) . 

1 - "2,5 

Notice that the derivative is computed by doing first-order inference to find p(class = 
5\vari, var2, var^, cj), 9), as noted by Russell, Binder, and KoUer (1994). This property holds 
in general for exponential family models with missing or unknown variables. Derivatives 
are calculated by some first-order inference followed by a combination with derivatives of 
the w functions. Consider the notation for the exponential family introduced previously in 
Definition 4.1, where the functional form is: 



p(x\y,9,M) = 



exp i^Yyw,{9)t,{x,y)^ . 
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Consider the partial derivative of a marginal of this probability, p(x — u\y, 9, M), for u C x. 
Using Equation (18) in the Differentiation Lemma, the partial derivative becomes: 

diogpjx -u\y,e,M) 'dm{ei dz{e) 

= 2^—Q^^u\.-u,y,e(M^x,y))-—^. (19) 

8 = 1 

If the partition function is not known in closed form (the case with the Boltzmann ma- 
chine) then the final derivative dZ(9)/d9 is approximated (the key formula for doing this 
is Equation (28) in Appendix B). 

To extend the Differentiation Lemma to chain graphs, use the trick illustrated with the 
feed-forward network. First, interpret the chain graph as a Bayesian network on chain com- 
ponents, as done in Equation (7), then apply the Differentiation Lemma. Finally, evaluate 
necessary local partial derivatives with respect to 9i of each individual chain component. 
Since undirected graphs are not necessarily normalized, this may present a problem. In 
general, there is an undirected graph G' on variables X UY . Following Theorem 2.1, the 
general form is: 

....... Uce Cliques{G') 

fc(C) 

Cliques{G' ) 

fc(C) • 

The local partial derivative with respect to x becomes: 

dlogp(X\Y) ^ f ^ dlog fc(C) \ ^^^^ f ^ dlog fc(C) \ 

^'^ \CeCHques(G'),xeC ^'^ / \CeCHques(G'),xeC ^'^ / 

(20) 

The difficulty here is computing the expected value in the formula, which comes from the 
normalizing constant. Indeed, this computation forms the core of the early Boltzmann 
machine algorithm (Hertz et al., 1991). In general, this must be done using something like 
Gibbs sampling and the techniques of Section 7.1 can be applied directly. 

6.2 Decomposing learning problems 

Learning problems can be decomposed into sub-problems in some cases. While the material 
in this section applies generally to these sorts of decompositions, this section considers one 
simple example and then proves some general results on problems decomposition. Prob- 
lem decompositions can also be recomputed on the fiy to create a search through a space of 
models that takes advantage of decompositions that exist. A general result is also presented 
on incremental decomposition. These results are simple applications of known methods for 
testing independence (Frydenberg, 1990; Lauritzen et al., 1990), with some added compli- 
cation because of the use of plates. 

Consider the simple learning problem given in Section 3, Figure 11 over two multinomial 
variables vari and var2, and two Gaussian variables xi and X2. For this problem we have 
specified two alternative models, model Mi and model M2. Model M2 has an additional 
arc going from the discrete variable var2 to the real valued variable xi. We will use this 
subsequently to discuss local search of these models evaluated by their evidence. 

A manipulation of the conditional distribution for this model, making use of Lemma 2.1, 
yields, for model Mi, the conditional distribution given in Figure 24. When parameters. 
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Figure 24: A simplification of model Mi 



01, 02 are a priori independent, and their data likelihoods do not introduce cross terms 
between them, the parameters become a posteriori independent as well. This occurs for 0i, 

02, and the set {/^i^ai}. This model simplification also implies the evidence for model Mi 
decomposes similarly. Denote the sample of the variable xi as si^* = xi^i, . . . , xi^n, and 
likewise for vari and var2. In this case, the result is: 

evidence{Mi) = p{vari^^\Mi) p{var2,*\vari^^, Mi) p{xi^^\vari^^, Mi) p{x2,*\xi^^,vari^^, Mi) . 



The evidence for model M2 is similar except that the posterior distribution of /^i and ai is 
replaced by the posterior distribution for /^'i and a'l. 

This result is general, and applies to Bayesian networks, undirected graphs, and more 
generally to chain graphs. Similar results are covered by Dawid and Lauritzen (1993) 
for a family of models they call hyper-Markov. The general result described above is an 
application of the rules of independence applied to plates. This uses the notion of non- 
deterministic children and parents introduced in Definition 6.1. It also requires a notion of 
local dependence, which is called the Markov blanket, following Pearl (1988), since it is a 
generalization of the equivalent set for Bayesian networks. 

Definition 6.2 We have a chain graph G without plates. The Markov blanket of a node u is 
all neighbors, non-deterministic parents, non-deterministic children, and non-deterministic 
parents of the children and their chain components: 

Markov-blanket(u) = neighbors(u) U ndparents(u) U ndchildren(u) (22) 



From Frydenberg (1990) it follows that u is independent of the other non-deterministic 
variables in the graph G given the Markov blanket. 

To perform the simplification depicted in Figure 24, it is sufficient then to find the finest 
partitioning of the model parameters such that they are independent. The decomposition 
in Figure 24 represents the finest such partition of model Mi. The evidence for the model 
will then factor according to the partition, as given for model Mi in Equation (21). For 
this task there is the following theorem, depicted graphically in Figure 25. 



(21) 



U ndparents(chain-components(ndchildren(u))) . 
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Theorem 6.1 (Decomposition). A model M is represented by a chain graph G with plates. 
Let the variables in the graph be X . There are P possibly empty subsets of the variables X , 
Xi for i = 1, . . . , P such that unknown(Xi) is a partition of unknown{X). This induces a 
decomposition of the graph G into P subgraphs Gi where: 

• the graph Gi contains the nodes Xi and any arcs and plates occurring on these nodes, 
and 

• the potential functions for cliques in Gi are equivalent to those in G. 

The induced decomposition represents the unique finest equivalent independence model to 
the original graph if and only if Xi for i = 1, . . . , P is the finest collection of sets such that, 
when ignoring plates, for every unknown node u in Xi, its Markov blanket is also in Xi. 
This finest decomposition takes 0(|Xp) to compute. Furthermore, the evidence for M now 
becomes a product over each subgraph: 

evidence(M) = p(known(X^)\M) = /o ]^ /8(fcraoi/;ra(Xj'^*)) (23) 

i 

for some functions fi (given in the proof). 

Figure 25 shows how this decomposition works when there are unknown nodes. Fig- 
ure 25(a) shows the basic problem and Figure 25(b) shows the finest decomposition. Notice 
the bottom component cannot be further decomposed because the variable xi is unknown. 



(a) 



(b) 
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Figure 25: The incremental decomposition of a model 



In some cases, the functions fi given in the Decomposition Theorem in Equation (23) 
have a clean interpretation: they are equal to the evidence for the subgraphs. This result 
can be obtained from the following corollary. 
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Corollary 6.1.1 (Local Evidence). In the context of Theorem 6.1, suppose there ex- 
ists a set of chain components Tj from the graph ignoring plates such that Xj = Tj U 
ndparents(Tj), where unknown(ndparents(Tj)) = 0. Then 

fj{known{Xj^^)) = p{known{Tj)^\ndparents{Tj)^,M) . 

If we denote the j'-th subgraph by model ^ then this term is the conditional evidence 
for model given ndparents(Tj)^. Denote by Mq the maximal subgraph on known 
variables only (induced by cliqueso as given in the proof of the Decomposition Theorem). 
If the condition of Corollary 6.1.1 holds for for j = 0, 1, . . . , P, then it follows that the 
evidence for the model M is equal to the product of the evidence for each subgraph: 

P 

evidence(M) = Y['^vidence(Mf ) . (24) 

8 = 

This holds in general if the original graph G is a Bayesian network, as used in learning 
Bayesian networks (Buntine, 1991c; Cooper & Herskovits, 1992). 

Corollary 6.1.2 Equation (24) holds if the parent graph G is a Bayesian network with 
plates. 

In general, we might consider searching through a family of graphical models. To do this 
local search (Johnson, Papdimitriou, & Yannakakis, 1985) or numerical optimization can be 
used to find high posterior models, or Markov chain Monte Carlo methods to select a sample 
of representative models, as discussed in Section 7.2. To do this, how to represent a family 
of models must be shown. Figure 26, for instance, is similar to the models of Figure 11 
except that some arcs are hatched. This is used to indicate that these arcs are optional. 




Figure 26: A family of models (optional arcs hatched) 

To instantiate a hatched arc they can either be removed or replaced with a full arc. This 
graphical model then represents many different models, for all 2"* possible instantiations of 
the arcs. Prior probabilities for these models could be generated using a scheme such as in 
(Buntine, 1991c, p54) or (Heckerman et al., 1994), where a prior probability is assigned by 
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a domain expert for different parts of the model, arcs and parameters, and the prior for a 
full model found by multiplication. The family of models given by Figure 26 includes those 
of Figure 11 as instances. During search or sampling, an important property is the Bayes 
factor for the two models, Bayes- factor(M2, Mi), as described in Section 3.1. Because of 
the Decomposition Theorem and its corollary, the Bayes factor for M2 versus Mi can be 
found by looking at local Bayes factors. The difference between models Mi and M2 is the 
parent for the variable xi, 



That is, the Bayes factor can be computed from only considering the models involving ^ui, ai 
and 

This incremental modification of evidence, Bayes factors, and finest decompositions 
is also general, and follows directly from the independence test. A similar property for 
undirected graphs is given in (Dawid & Lauritzen, 1993). This is developed below for the 
case of directed arcs and non-deterministic variables. Handling deterministic variables will 
require repeated application of these results, because several non-deterministic variables 
may be effected when adding a single arc between deterministic variables. 

Lemma 6.3 (Incremental decomposition). For a graph G in the context of Theorem 6.1, 
we have two non-deterministic variables U and V such that U is given. Consider adding 
or removing a directed arc from U to V . To update the finest decomposition of G, there is 
a unique subgraph containing the unknown variables in ndparents(chain-component(V )). 
To this subgraph add or delete an arc from U to V , and add or delete U to the subgraph if 
required. 

Shaded non-deterministic parents can be added at will to nodes in a graph, and the finest 
decomposition remains unchanged except for a few additional arcs. The use of hatched arcs 
in these contexts causes no additional trouble to the decomposition process. That is, the 
finest decomposition for a graph with plates and hatched directed arcs is formed as if the 
arcs where unhatched directed arcs. The evidence is adjusted during the search by adding 
the different parents as required. 

6.3 Bayes factors for the exponential family 

To make use of the decomposition results in a learning system it is necessary to be able to 
generate Bayes factors or evidence for the component models. For models in the exponential 
family, whose normalization constant is known in closed form, this turns out to be easy. 
If these exact computations are not available, various approximation methods can be used 
to compute the evidence or Bayes factors (Kass & Raftery, 1993); some are discussed in 
Section 7.3. 

If the conjugate distribution for an exponential family model and its derivatives can 
be readily computed, then the Bayes factor for the model can be found in closed form. 
Along with the above decomposition methods, this result is an important basis of many 
fast Bayesian algorithms considering multiple models. It is used explicitly or implicitly in 
all Bayesian methods for learning decision trees, directed graphical models (with discrete 
or Gaussian variables), and linear regression (Buntine, 1991b; Spiegelhalter et al., 1993). 
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For instance, if the normalizing constant Zg{T) in Lemma 4.1 was known in closed form, 
then the Bayes factor can be readily computed. 

Lemma 6.4 Consider the context of Lemma 4.1. Then the model likelihood or evidence, 
given by evidence(M ) = p(xi, . . . , a;jv|yi, • • • , Vn, M), can he computed as: 

evidence{M) = 

Zeir') 



Ze{T)Z^ 



For y\x ~ Gaussian this involves multiplying out the two sets of normalizing constants for 
the Gaussian and Gamma distributions. The evidence for some common exponential family 
distributions is given in Appendix B in Table 5 

For instance, consider the learning problem given in Figure 24. Assume that the variables 
vari and var2 are both binary (0 or 1) and that the parameters 9i and 02 are interpreted 
as follows: 

p(vari = 0|^i) = di , 
p(var2 = 0\vari = 0,02) = 6*2,010, 
p(var2 = 0\vari = 1,02) = 02,o\i ■ 

If we use Dirichlet priors for these parameters, as shown in Table 3, then the priors are: 

(^1,1 — ^1) ~ Dirichlet(Q!i^07 cii,i) 7 
(^2,o|j, 1 - 6*2,01^) ~ Dirichlet (q!2,o|j, a2,i\j) for j = 0, 1 , 

where ^2,o|o is « priori independent of ^2,o|i- The choice of priors for these distributions 
is discussed in (Box & Tiao, 1973; Bernardo & Smith, 1994). Denote the corresponding 
sufficient statistics as nij (equal to the number of data where vari = j) and n2j\i (equal to 
the number of data where var2 = j and vari = 0- Then the first two terms of the evidence 
for model Mi, read directly from Table 5, can be written as: 

Beta{nio + aio,nii + ail) 
p{vari^^\Mi) = — 



p{var2,*\vari^^, M- 



Beta{aifi, Q!i,i) 

Beta{n2^o\o + Q!2,o|o, ^^2,110 + "2,i|o) Beta{n2^o\i + Q!2,o|i, ^^2,111 + "2,i|iy 



Beta{a2^o\o, Q!2,i|o) ^eia(Q!2,o|i, a2,i|i) 

Assume the variables xi and X2 are Gaussian with means given by 

Hi^Q when vari = 0, 

Hi^i when vari = 1, 

/«2|o,i + IJ'2\o,2^i when vari = , 

/"2|i,i + /"2|i,2a^i when vari = 1 

and variances a^j and a2\j respectively. In this case, we split the data set into two parts, 
those when vari = 0, and those when vari = 1- Each get their own parameters, sufficient 
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statistics, and contribution to the evidence. Conjugate priors from Table 3 in Appendix B 
(using y\x Gaussian) are indexed accordingly as: 

^0 i\j 

fj,i\j\ai\j ~ Gaussian(;Uo,8|i7 — i — ) foi" « = 1, 2 and j = 0, 1 , 
a~^J ~ Gamma(^o,8|j/2, f^o,t\j) for « = 1, 2 and j = 0, 1 . 



Notice that Sg^iij is one- dimensional when i = and two-dimensional when i = 2. Suitable 
sufficient statistics for this situation are read from Table 4 by looking at the data summaries 
used there. This can be simplified for xi because d = 1 and yi for the Gaussian is uniformly 
f. Thus the sufficient statistics for xi become the means and variances for the different 
values of fari. Denote x^q and x^i as the sample means of when vari = 0, f, respectively, 
and and their corresponding sample variances. This cannot be done for the second 
case, so we use the notation from Table 4, where T,,JI,(3 from Table 4 become, respectively, 
^2\jjJh\jT P2\j- Change the vector y to (l,xi) when making the calculations indicated here. 
The sufficient statistics are, for each case of vari = j: 

N 

'^2|j = ''^^vari^,=j yiyi i 

8 = 1 

N 

8 = 1 

N 

8 = 1 

The evidence for the last two terms can now be read from Table 5. This becomes: 



J- 



0,1 vr'^^.^/ySo,ib- + ^i,. TiSo,ii,/2)(3l°,]f' 



p(a;2,*Fi,*,wari,*,Mi) = [[ ^ 



('5o,2|j+'ii,j)/2 



^o^vr'^^./^det^/^S^i, T{So,,y/2)^l%]f' 
The final simplification of the model is given in Figure 27. 



7. Approximate methods on graphical models 

Exact algorithms for learning of any reasonable size invariably involve the recursive arc re- 
versal theorem of Section 4.2. Most learning methods, however, use approximate algorithms 
at some level. The most common uses of the exponential family within approximation algo- 
rithms were summed up in Figure 22. Various other methods for inference on plates can be 
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Figure 27: The full simplification of model Mi 



applied either at the model level or the parameter level: Gibbs sampling, first described in 
Section 7.1, other more general Markov chain Monte Carlo algorithms, EM style algorithms 
(Dempster, Laird, & Rubin, 1977), and various closed form approximations such as the 
mean field approximation, and the Laplace approximation (Berger, 1985; Azevedo-Filho & 
Shachter, 1994). This section summarizes the main families of these approximate methods. 

7.1 Gibbs sampling 

Gibbs sampling is the basic tool of simulation and can be applied to most probability 
distributions (Geman & Geman, 1984; Gilks et al., 1993a; Ripley, 1987) as long as the full 
joint has no zeros (all variable instantiations are possible). It is a special case of the general 
Markov chain Monte Carlo methods for approximate inference (Ripley, 1987; Neal, 1993). 
Gibbs sampling can be applied to virtually any graphical model whether there are plates, 
undirected or directed arcs, and whether the variables are real or discrete. Gibbs sampling 
does not apply to graphs with deterministic nodes, however, since these put zeroes in the 
full joint. This section describes Gibbs sampling without plates, as a precursor to discussing 
Gibbs sampling with plates in Section 7.2. On challenging problems, other forms of Markov 
chain Monte Carlo sampling can and should be tried. The literature is extensive. 

Gibbs sampling corresponds to a probabilistic version of gradient ascent, although their 
goals of averaging as opposed to maximizing are fundamentally different. Gradient ascent in 
real valued problems corresponds to simple methods from function optimization (Gill et al., 
1981) and in discrete problems corresponds to local repair or local search (Johnson et al., 
1985; Minton, Johnson, Philips, & Laird, 1990; Selman, Levesque, & Mitchell, 1992). Gibbs 
sampling varies gradient ascent by introducing a random component. The algorithm usually 
tries to ascend, but will sometimes descend, as a strategy for exploring further around the 
search space. So the algorithm tends to wander around local maxima with occasional 
excursions to other regions of the space. Gibbs sampling is also the core algorithm of 
simulated annealing if temperature is held equal to one (van Laarhoven & Aarts, 1987). 

To sample a set of variables X according to some non-zero distribution p(X), initialize X 
to some value and then repeatedly resample each variable x £ X according to its conditional 
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probability p(x\X — {x}). For the simple medical problem of Figure 2, suppose the value 
of symptoms is known, and the remaining variables are to be sampled, then do as follows: 

1. Initialize the remaining variables somehow. 

2. Repeat the following for i = 1, 2, 3, . . ., and record the sample of Agci, Occi, Clinii, Disi 
at the end of each cycle. 

(a) Reassign Age by sampling it according to the conditional: 

p(Age\Occ,Clim, Dis, Symp) . 

That is, take the values of Occ,Clim, Dis, Symp as given and compute the re- 
sulting conditional distribution on Age. Then sample Age according to that 
distribution. 

(b) Reassign Occ by sampling it according to the conditional: 

p(Occ\Age,Clim, Dis, Symp) . 

(c) Reassign Clim by sampling it according to the conditional: 

p(Clim\Age,Occ, Dis, Symp) . 

(d) Reassign Dis by sampling it according to the conditional: 

p(Dis\Age, Clim, Occ, Synip) . 
This sequence of steps is depicted in Figure 28. In this figure, the basic graph has been re- 




Figure 28: Gibbs sampling on the medical example 



arranged for each step to represent the dependencies that arise during the sampling process. 
This uses the arc reversal and conditioning operators introduced previously. 

The effect of sampling is not immediate. Age2, Occ2, Clim2, Dis2 is conditionally depen- 
dent on Agci, Occi, Clinii, Disi, and in general so is Agci, Occi, Clinii, Disi for any i. How- 
ever the effect of the sampling scheme is that in the long run, for large i, Agci, Occi, Clinii, Disi 
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is approximately generated according to p(Age,Occ,Clim,Dis\Symp) independently of 
Agei,Occi,Climi, Disi. In Gibbs sampling, all the conditional sampling is done in accor- 
dance with the original distribution, and since this is a stationary process, in the long run 
the samples converge to the stationary distribution or fixed-point of the process. Methods 
for making subsequent samples independent are known as regenerative simulation (Ripley, 
1987) and correspond to sending the temperature back to zero occasionally. 

With this sample different quantities such as the probability a patient will have Age > 20 
and Clim = tropical given Symp can be estimated. This is done by looking at the frequency 
of this event in the generated sample. The justification for this is the subject of Markov 
process theory ((^inlar, 1975, Theorem 2.26). The following result, presented informally, 
applies: 

Comment 7.1 Let xi,X2, ■ ■ ■ ,xi be a sequence of discrete variables from a Gibbs sampler 
for the distribution p(x) > 0. Then the average of g(xi) approaches the expected value with 
probability 1 as I approaches infinity: 

1 ' 



I 



^g{xi) — > g{x) = E{g{x)) 



-1 

Further, for a second function h{xi), the ratio of two sample averages for g and h approaches 

their "true" ratio: 

Ef=i 9{xi) ^ g^ 

This is used to approximate conditional expected values. 

To complete this procedure it is necessary to know how many Gibbs samples to take, how 
large to make /, and how to estimate the error in the estimate. Both these questions have 
no easy answer but heuristic strategies exist (Ripley, 1987; Neal, 1993). 

For Bayesian networks this scheme is easy in general since the only requirement when 
sampling from p(x\X — {x}) is the conditional distribution for nodes connected to x, and 
the global probabilities do not need to be calculated. Notice, for instance, that in Figure 28 
some sampling operations do not require all five variables. The general form for Bayesian 
networks given in Equation (2) goes as follows: 

p(x\X-{x}) - - Pix\parentsix)) Uy xeparent.s{y) Piy\parentsiy)) 

J2xP{^) J2xP(x\parents(xj) JJy . ^^^^^^^.(j^) p(y|parerais(y)) ' 

Notice the product is over a subset of variables. Only include conditional distributions for 
variables that have a; as a parent. Thus, the formula only involves examining the parents, 
children and children's parents of a;, the so-called Markov blanket (Pearl, 1988). Also, notice 
normalization is only required over the single dimension changed in the current cycle, done 
in the denominator. For x discrete, these conditional probabilities can be enumerated and 
direct sampling done for x. 

The kind of simplification above for Bayesian networks also applies to undirected graphs 
and chain graphs, with or without plates. Here, modify Equation (4): 



p(x\X - {X}) = ^ „ • (25) 

Es^exp n : xeCecHques{G) Jc[<^) 
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In this formula, ignore all cliques not containing x so, again, Gibbs sampling only computes 
with information local to the node. Also, the troublesome normalization constant does not 
have to be computed because the probability is a ratio of functions and so cancels out. As 
before, normalization is only required over the single dimension x. 

7.2 Gibbs sampling on plates 

Many learning problems can be represented as Bayesian networks. For instance, the simple 
unsupervised learning problem represented in Figure 10 is a Bayesian network once the 
plate is expanded out. It follows that Gibbs sampling is readily applied to learning as a 
general inference algorithm (Gilks et al., 1993a, 1993b). 

Consider a simplified example of this unsupervised learning problem. In this model, 
assume that each variable vari and var2 belongs to a mixture of Gaussians of known 
variance equal to 1.0. This simple model is given in Figure 29. For a given class, class = c, 

-€) 

Figure 29: Unsupervised learning in two dimensions 

the variables vari and var2 are distributed as Gaussian with means /^i^c and /U2,c- In the 
uniform, unit-variance case the distribution for each sample is given by: 

p(vari,var2\,(l),l^,M) = ^(])cN {vari - {var2 - li2,c) , 

c 

where A(,) is the one- dimensional Gaussian probability density function with standard 
deviation of 1. This model might seem trivial, but if the standard deviation were to vary 
as well, the model corresponds to a Kernel density estimate so can approximate any other 
distribution arbitrarily well using sufficient number of tiny Gaussians. 

In this simplified Gaussian mixture model, the sequence of steps for Gibbs sampling 
goes as follows: 

1. Initialize the variables <?!>c, Mi^c, M2,c for each class c. 

2. Repeat the following and record the sample of (/^c I^i,ct l^2,c for each class c at the end 
of each cycle. 

(a) For i = 1, . . . , A, reassign classi according to the conditional: 

p{class, I vari^t,var2,t,(t), 1^1, 1^2) ■ 

(b) Reassign the vector cj) by sampling according to the conditional: 

p((l)\classi : i=l,...,A). 
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(c) Reassign the vector /^i (and 122) by sampling according to the conditional: 

p(pi\vari^i, classi : i=l,...,N). 
Figure 30(a) illustrates Step 2(a) in the language of graphs. Figure 30(b) illustrates 



Steps 2(b) and 2(c). Step 2(a) represents the standard sampling operation using infer- 
ence on Bayesian networks without plates. Step 2(b) and 2(c) are also easy to perform 
because in this case the distributions are exponential family, and the graph matches the 
conditions for Lemma 6.1.2. Therefore, each of the model parameters cj), 1^1, 1^2 are a posteri- 
ori independent and their distribution is known in closed form, with the sufficient statistics 
calculated in 0(N) time. 

One important caveat in the use of Gibbs sampling for learning is the problem of sym- 
metry. In the above description, there is nothing to distinguish class 1 from class 2. Initially, 
the class centers for the above process will remain distinct. Asymptotically, since there 
is nothing in the problem definition to distinguish between class 1 and class 2, they will 
appear indistinguishable. This problem is handled by symmetry breaking: force /^i^c < M2,c- 

Gibbs sampling applies whenever there are variables associated with the data that are 
not given. Hidden or latent variables are an example. Incomplete data (or missing values) 
(Quinlan, 1989), robust methods and modeling of outliers, and various density estimation 
and non-parametric methods all fall in this family of models (Titterington et al., 1985). 
Gibbs sampling generalizes to virtually any graphical model with plates and unshaded 
nodes inside the plate; the sequence of sampling operations will be much the same as in 
Figure 30. If the underlying distribution is exponential family, for instance. Lemma 5.1 
applies after shading all nodes inside the plate; each full cycle is guaranteed to be linear 
time in the sample size. The algorithm in the exponential family case is summed up in 
Figure 31. Figure 31(1) shows the general learning problem, extending the mixture model 
of Figure 22. This same structure appears in Figure 29. However, in Figure 31(1) the 
sufficient statistics T{x^,u^) are also shown. Figure 31(11) shows more of the algorithm, 
generalizing Figure 30(a) and (b). Again the role of sufficient statistics is shown. The 
sampling in Figure 30(b) first computes the sufficient statistics and then sampling applies 
from that. 

Thomas, Spiegelhalter, and Gilks (1992)(Gilks et al., 1993b) have taken advantage of 
this general applicability of sampling to create a compiler that converts a graphical represen- 
tation of a data analysis problem, with plates, into a matching Gibbs sampling algorithm. 
This scheme applies to a broad variety of data analysis problems. 



(a) 




Figure 30: Gibbs sampling in the unsupervised learning problem 
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(I) 



(ll)(a) 




N 



Exp. Family 



(ll)(b) 



Exp. Family 




Figure 31: Gibbs sampling with the exponential family 



7.3 A closed form approximation 

What would happen to Gibbs sampling if the number of cases in the training sample, N , 
was large compared to the number of unknown parameters in the problem? A good way to 
think of this is: first, if N is sufficiently large, then the samples of the model parameters 
(j) and , will tend to drift around their mean because their posterior variance would 
be 0(1 /N). That is, after the i-th step, the sample is (j^i, l^i^i, l^2,i- The i + 1-th sample 
//i^i+i, //2,8+i would be conditionally dependent on these, but because N is large, the 
posterior variance of the i + 1-th sample given the i-th sample would be small so that: 



This approximation is used with Markov chain Monte Carlo methods by the mean field 
method from statistical physics, popular in neural networks (Hertz et al., 1991). Rather 
than sampling a sequence of parameters 9, 01,02, ■■■ ,0i, according to some scheme, use the 
deterministic update: 



where the expected value is according to the sampling distribution. This instead generates 
a deterministic sequence 0i,02, ■ ■ ■ ,0i that under reasonable conditions converges to some 
maxima. This kind of approach leads naturally to the EM algorithm, which will be discussed 
in Section 7.4. 

7.4 The expectation maximization (EM) algorithm 

The expectation maximization algorithm, widely known as the EM algorithm, corresponds 
to a deterministic version of Gibbs sampling used to search for the MAP estimate for model 
parameters (Dempster et al., 1977). It is generally considered to be faster than gradient 
descent. Convergence is slow near a local maxima so some implementations switch to con- 
jugate gradient or other methods (Meilijson, 1989) when near a solution. The computation 
used to find the derivative is similar to the computation used for the EM algorithm, so this 
does not require a great deal of additional code. Also, the determinism means the EM algo- 
rithm no longer generates unbiased posterior estimates of model parameters. The intended 




(26) 
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gain is speed, not accuracy. The EM algorithm can generally be applied to exponential 
family models wherever Gibbs sampling can be applied. The correspondence between EM 
and Gibbs is shown below. 

Consider again the simple unsupervised learning problem represented in Figure 29 (Sec- 
tion 7.2). In this case, the sequence of steps for the EM algorithm is similar to that for 
the Gibbs sampler. The EM algorithm works on the means or modes of unknown variables 
instead of sampling them. Rather than sampling the set of classes and thereby comput- 
ing sufficient statistics so that a distribution for (j) and 1^1,1^2 can be found, a sequence of 
class means are generated and thereby used to compute expected sufficient statistics. Like- 
wise, instead of sampling new parameters cj) and 121,122, modes are then computed from the 
expected sufficient statistics. 

Consider again the unsupervised learning problem in Figure 9. Suppose there are 10 
classes and that the three variables vari,var2,var3 are finite valued and discrete and mod- 
eled with a multinomial with probabilities conditional on the class value classi. The suf- 
ficient statistics in this case are all counts: Uj is the number of cases where the class is j; 
''^v,k\j is the number of cases where class = j and var^ = k: 

N 

= ^cla.s.St=j 1 

8 = 1 

N 

'^v,k\j — ^ ^ lcfass;=jl-»ar^,; = fc • 

8 = 1 

The expected sufficient statistics computed from the rules of probability for a given set of 
parameters cj) and ^1,^2,^3 are given by: 

N 

n~ = '^p(class, = j\vari^„var2,i, vars^^cj), 01, 92,93) , 
8=1 

N 

'VHi = ^P{(^^o.ss, = j\varl^„var2,^,var3,^„(t>,9l,92,93)lyar,,,=k ■ 
8=1 

Thanks to Lemma 4.2, these kinds of expected sufficient statistics can be computed for 
most exponential family distributions. Once sufficient statistics are computed for any of 
the distributions posterior means or modes of the model parameters (in this case (j) and 
^17^27^3) can be found. 

1. Initialize the parameters (j) and ^1,^27^3- 

2. Repeat the following until some convergence criteria is met: 

(a) Compute the expected sufficient statistics raj and n^^^j. 

(b) Recompute (j) and 9i,92, 9^ to be equal to their mode conditioned on the sufficient 
statistics. For many posterior distributions, these can be found in standard 
tables, and in most cases found via Lemma 4.2. For instance, using the mean for 
(j) gives: 

^ ^ nj+aj 
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All of the other Gibbs sampling algorithms discussed in Section 7.2 can be similarly 
placed in this EM framework. When the mode is used in Step 2(b), ignoring numerical 
problems, the EM algorithm converges on a local maxima of the posterior distribution for 
the parameters (Dempster et al., 1977). The general method is summarized in the following 
comment (Dempster et al., 1977). 

Comment 7.2 The conditions of Lemma 5.1 apply with data variables X inside the plate 
and model parameters 9 outside. In addition, some of the variables U C X are latent, so 
they are unknown and unshaded. Some of the remaining variables are sometimes missing, 
so, for the data Xi, variables Vi C (X — U) are not given. This means the data given for 
the i-th datum is X — U — Vi for i = 1, . . . N . The EM algorithm goes as follows: 

E-step: The contribution to the expected sufficient statistics for each datum is: 



The expected sufficient statistic is then ET = J2i=i ETi. 

M-step: Maximize the conjugate posterior using the expected sufficient statistics ET in 
place of the sufficient statistics using the MAP approach for this distribution. 

The fixed point of this algorithm is a local maxima of the posterior for 9. 

Figure 31(11) for Gibbs sampling in Section 7.2 illustrates the steps in EM nicely. Fig- 
ure 31(II)(a) corresponds to the E-step. The expected sufficient statistics are found from 
the parameters 9. Rather than sampling m, and therefore computing the sufficient statistics, 
the expected sufficient statistics are computed. Figure 31(II)(b) corresponds to the M-step. 
Here, given the sufficient statistics, the mean or mode of the parameters 9 are computed 
instead of being sampled. EM is therefore Gibbs with a mean/mode approximation done 
at the two major sampling steps of the algorithm. 

In some cases, the expected sufficient statistics can be computed in closed form. Assume 
the exponential family distribution for p{X\9) has a known normalization constant Z{9) and 
the link function exists. For some i, the normalizing constant for the exponential family 
distribution p{Ui, Vi\X — Ui — Vi, 9) is known in closed form. Denote it by Zi(9). Then using 
the notation of Theorem 4.1: 



8. Partial exponential models 

In some cases, only an initial, inner part of a learning problem can be handled using the 
recursive arc reversal theorem of Section 4.3. In this case, simplify what can be simplified, 
and then solve the remainder of the problem using a generic method like the MAP ap- 
proximation. This section presents several examples: linear regression with heterogeneous 
variance, feed-forward networks with a linear output layer, and Bayesian networks. 

This general process was depicted in the graphical model for the partial exponential 
family of Figure 22. This is an abstraction used to represent the general process. Consider 



ETt - £u,,v,\x-u,-v,,0{t{Xt)) ■ 
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the problem of learning a Bayesian network, both structure and parameters, where the 
distribution is exponential family given the network structure. The variable T is a discrete 
variable indicating which graphical structure is chosen for the Bayesian network. The 
variable X represents the full set of variables given for the problem. The variable ©x 
represents the distributional parameters for the Bayesian network, and is the part of the 
model that is conveniently exponential family. That is, p(X|0x,r) will be treated as 
exponential for different ©x and hold T fixed. The sufficient statistics in this case are 
given by ss(X*,T). In this case, the subproblem that conveniently falls in the exponential 
family, p(0x|X*,r) is simplified, but it is necessary to resort to the more general learning 
techniques of previous sections to solve the remaining part of the problem, p(T\X*). 

8.1 Linear regression with heterogeneous variance 

Consider the heterogeneous variance problem given in Figure 32. This shows a graphical 
model for the linear regression problem of Section 4.4 modified to the situation where the 
standard deviation is heterogeneous, so it is a function of the inputs x as well. In this case. 



The exponential transformation guarantees that the standard deviation s will also be posi- 
tive. 

The corresponding learning model can be simplified to the graph in Figure 33. Compare 
this with the model given in Figure 21. What is the difference? In this case, the sufficient 
statistics exist, but they are shown to be deterministically dependent on the sample and ulti- 
mately on the unknown parameters for the standard deviation weights-a. If the parameters 
for the standard deviation were known then the graph could be reduced to Figure 21. Com- 
putationally, this is an important gain. It says that for a given set of values for weights-a, 
calculation can be done that is linear time in the sample size to arrive at a characterization 




Figure 32: Linear regression with heterogeneous variance 



the standard deviation s is not given but is computed via: 
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Figure 33: The heterogeneous variance problem with the plate simplified 

of what the parameters from the mean, weights-/^, should be. In short, one half of a prob- 
lem, p{weights- ii\weights-a , yi, x^^i : i = 1, . . . , iV), is well understood. To do a search for 
the MAP solution, search the space of parameters for the standard deviation, weights-a, 
since the remaining {weights- ji) is given. 

Another variation of linear regression replaces the Gaussian error function with a more 
robust error function such as Student's t distribution, or an Lq norm for 1 < g < 2. By 
introducing a convolution, these robust regression models can be handled by combining the 
EM algorithm with standard least squares (Lange & Sinsheimer, 1993). 

8.2 Feed-forward networks with a linear output layer 

A similar example is the standard feed-forward network where the final output layer is linear. 
This situation is given by Figure 23 if we change the deterministic functions for mi and m2 
to be linear instead of sigmoidal. In this case Lemma 5.1 identifies that when the weight 
vectors W3,W4, and are assumed given, the distribution is in the exponential family. 
Thus the simplification to Figure 34 is possible using the standard sufficient statistics for 
multivariate linear regression. Algorithmically, this implies that given values for the internal 
weight vectors W3,W4, and W5, and assuming a conjugate prior holds for the output weight 
vectors wi and W2, the posterior distribution for the output weight vectors wi and W2, and 
their means and variances, can be found in closed form. The evidence for wi and W2 given 
iU3,i/;4 and 1/75, 1/73, 1/74, 1/75, M), can also be computed using the exact 

method of Lemma 6.4, so therefore the posterior for U73, U74, and U75, 

p(-M73,-M74,-M75|y*,a;i,*, . . .,a;„,*,M) OC p(-M73,-M74,-M75|M)p(y*|a;i,*, . . .,a;„,*,-M73,-M74,-M75,M) , 

can be computed in closed form up to a constant. This effectively cuts the problem into 
two pieces — U73,U74 and followed by wi and W2 given U73,U74, and — and provides a 
clean solution to the second piece. 
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Figure 34: Simplified learning of a feed-forward network with linear output 
8.3 Bayesian networks with missing variables 

Class probability trees and discrete Bayesian networks can be learned efficiently by notic- 
ing that their basic form is exponential family (Buntine, 1991b, 1991a, 1991c; Cooper & 
Herskovits, 1992; Spiegelhalter et al., 1993). Take, for instance, the family of models spec- 
ified by the Bayesian network given in Figure 26. In this case, the local evidence corollary. 
Corollary 6.1.1, applies. The evidence for Bayesian networks generated from this graph 
is therefore a product over the nodes in the Bayesian network. If we change a Bayesian 
network by adding or removing an arc, the Bayes factor is therefore simply the local Bayes 
factor for the node, as mentioned in the incremental decomposition lemma. Lemma 6.3. 
Local search is then quite fast, and Gibbs sampling over the space of Bayesian networks is 
possible. A similar situation exists with trees (Buntine, 1991b). The same results apply to 
any Bayesian network with exponential family distributions at each node, such as Gaussian 
or Poisson. Results for Gaussians are presented, for instance, in (Geiger & Heckerman, 
1994). 

This local search approach is a MAP approach because it searches for the network 
structure maximizing posterior probability. More accurate approximation can be done by 
generating a Markov chain of Bayesian networks from the search space of Bayesian networks. 
Because the Bayes factors are readily computed in this case, Gibbs sampling or Markov chain 
Monte Carlo schemes can be used. The scheme given below is the Metropolis algorithm 
(Ripley, 1987). This only looks at single neighbors until a successor is found. This is done 
be repeating the following steps: 

1. For the initial Bayesian network G, randomly select a neighboring Bayesian network 
G' differing only by an arc. 
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2. Compute B ayes- factor {G' ,G) by making the decompositions described in Theo- 
rem 6.1, doing a local computation as described in Lemma 6.3, and using the Bayes 
factors computed with Lemma 6.4. 

3. Accept the new Bayesian network G' with probability given by: 



If accepted, assign G' to G, otherwise G remains unchanged. 

A local maxima Bayesian network could be found concurrently, however, this scheme gen- 
erates a set of Bayesian networks appropriate for model averaging and expert evaluation 
of the space of potential Bayesian networks. Of course, initialization might search for lo- 
cal maxima to use as a reference. This sampling scheme was illustrated in the context of 
averaging in Figure 12. 

This scheme is readily adapted to learn the structure and parameters of a Bayesian 
network with missing or latent variables. For the Metropolis algorithm, add Step 4, which 
resamples the missing data and latent variables. 

4. For the current complete data and Bayesian network G, compute the predictive dis- 
tribution for the missing data or latent variables. Use this to resample the missing 
data or latent variables to construct a new set of complete data (for subsequent use 
in computing Bayes factors). 

9. Conclusion 

The marriage of learning and graphical models presented here provides a framework for 
understanding learning. It also provides a framework for developing a learning or data 
analysis toolkit, or more ambitiously, a software generator for learning algorithms. Such a 
toolkit combines two important components: a language for representing a learning problem 
together with techniques for generating a matching algorithm. While a working toolbox is 
not demonstrated, a blueprint is provided to show how it could be constructed, and the 
construction of some well-known learning algorithms has been demonstrated. Table 1 lists 
some standard problems, the derivation of algorithms using the operations from the previous 
chapters, and where in the text they are considered. The notion of a learning toolkit is not 
new, and can be seen in the BUGS system by Thomas, Spiegelhalter, and Gilks (1992)(Gilks 
et al., 1993b), in the work of Cohen (1992) for inductive logic programming, and emerging 
in software for handling generalized linear models (McCuUagh & Nelder, 1989; Becker, 
Chambers, k Wilks, 1988). 

There is an important role for a data analysis toolkit. Every problem has its own quirks 
and requirements. Knowledge discovery, for instance, can vary in many ways depending 
on the user-defined notion of interestingness. Learning is often an embedded task in a 
larger system. So while there are some easy applications of learning, generally learning 
applications require special purpose development of learning systems or related support 
software. Sometimes, this can be achieved by patching together some existing techniques or 
by decomposing a problem into subproblems. Nevertheless, the decomposition and patching 



min I 1^ Bayes- f actor{G\G) 
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Problem 


Method 


Sections 


Bayesian networks and expo- 
nential family conditionals 


Decomposition of exact Bayes factors, 
with local search or Gibbs sampling to 
generate alternative models 


6.2, 6.3, 8.3 


Bayesian networks with miss- 
ing/latent variables, and other 
unsupervised learning models 


/~1 '11 1 • ^~^ "i\ J" 1 • c 

Gibbs sampling or FiVL making use oi 
above techniques where possible 


7.2, 7.4, 8.3 


Feed-forward networks 


MAP method by exact computation of 
derivatives 


6.1 


Feed-forward networks with 
linear output 


As above with an initial removal of the 
linear component 


8.2 


Linear regression and 
extensions 


Least squares, EM, and MAP 


4.4, 8.1 


Generalized linear models 


MAP method by exact computation of 
derivatives 


2.3, 6.1 



Table 1: Derivation of learning algorithms 



of learning algorithms with inference and decision making can be formalized and understood 
within graphical models. In some ways the S system plays the role of a toolkit (Chambers 
& Hastie, 1992). It provides a system for prototyping learning algorithms, includes the 
ability to handle generalized linear models, does automatic differentiation of expressions, 
and includes many statistical and mathematical functions useful as primitives. The language 
of graphical models is best viewed as an additional layer on top of this kind of system. Note, 
also, that it is impractical to assume that a software generator could create algorithms 
competitive with current finely tuned algorithms, for instance, for hidden Markov models. 
However, a software toolkit for learning could be used to prototype an algorithm that could 
later be refined by hand. 

The combination of learning and graphical models shares some of the superior aspects 
of each of the different learning fields. Consider the philosophy of neural networks. These 
nonparametric systems are composed of simple computational components, usually readily 
parallelizable, and often nonlinear. The components can be pieced together to tailor systems 
for specific applications. Graphical models for learning have these same features. Graphical 
models also have the expressibility of probabilistic knowledge representations that were 
developed in artificial intelligence to be used in knowledge acquisition contexts. They 
therefore form an important basis of knowledge refinement. Finally, graphical models for 
learning allow the powerful tools of statistics to be applied to the problem. 

Once learning problems are specified in the common language of graphical models, their 
associated learning algorithms, their derivation, and their interrelationships can be explored. 
This allows commonalities between seemingly diverse pairs of algorithms — such as k-means 
clustering versus approximate methods for learning hidden Markov models, learning decision 
trees versus learning Bayesian networks (Buntine, 1991a), and Gibbs sampling versus the 
expectation maximization algorithm in Section 7.4 — to be understood as variations of one 
another. The framework is important as an educational tool. 
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Appendix A. Proofs of Lemmas and Theorems 
A.l Proof of Theorem 2.1 

A useful property of independence is that A is independent of B given C if and only if 
p(A, B, C) = f(A, C)g(B, C) for some functions / and g. The only if result follows directly 
from this property. The proof of the if result makes use of the following simple lemma. If 
A is independent of B given X — A — B , and p(X) = Yli fii^i) for some functions fi > 
and variable sets Xi C X , then: 

p(X) = l[gdX, - B) h,{X, - A) (27) 

i 

for some functions gi, hi > 0. Notice that it is known p{X) = g(X — B) h(X — A) for some 
functions g and h by independence. Instantiate the variables in B to some value b. Then: 

g(X-B)h(X-A,B = b) = l[MX,,B = b). 

i 

Similarly, instantiate A to a, then: 

g(X-B,A = a)h(X-A) = l[MX,,A = a). 

i 

Multiplying both sides of the two equalities together, and substitute in 

g(X - B,A = a)h(X - A,B = b) = p(X,A = a,B = b) = Y[MX,, A = a, B = b) . 

i 

Get: 

p{X) n MX,, A = a,B = b) = n MX,, B = b)l[ MX,, A = a) . 

i i i 

This is defined for all X since the domain is a cross product. The lemma holds because all 
functions are strictly positive if 

_ MX.,B = b) 
' ^ ~ MX..B = b,A = a) 

and 

h,{X,-A) = MX^,A = a) . 

The final proof of the if result follows by applying Equation (27) repeatedly. Suppose 
the variables in X are xi, . . ., Xy. Now p(X) = foiX) for some strictly positive function /q. 
Therefore: 

p(X) = go(X — {xi})ho({xi} U neighbors(x)) . 

Denote Ai^o = X — {xi} and Ai^i = {xi} U neighbor s{xi). Repeating the application of 
Equation (27) for each variable yields: 

K^) = n ••• n 9n,...,Ax- u ^^-.^i) • 

«'l=0,l ?'^=0,l j = l,...,u 

for strictly positive functions gi^^,,,^i^. Now, consider these functions. It is only necessary to 
keep the function gi^^,,,^i^ if the set X — Uj=i,...,tj maximal: it is not contained in any 

such set. Equivalently, keep the function gi^^,,,^i^ if the set Uj=i v^hi] minimal. The 
minimal such sets are the set of cliques on the undirected graph. The result follows. 
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A. 2 Proof of Theorem 6.1 

It takes 0(|Xp) operations to remove the deterministic nodes from a graph using Lemma 6.1. 
These nodes can be removed from the graph and then reinserted at the end. Hereafter, as- 
sume the graph contains no deterministic nodes. Also, denote the unknown variables in a 
set Y by unknown{Y) = Y — known(Y ). Then, without loss of generality, assume that Xi 
contains all known variables in the Markov blanket of unknown(Xi). 

Showing the independence model is equivalent amounts to showing for i = 1, . . . , P that 
unknown(Xi) is independent of Uj^^i unknown(Xj) given known(X ). To test independence 
using the method of Frydenberg (1990), each plate must be expanded (that is, duplicate it 
the right number of times), moralize the graph, removing the given nodes, and then test 
for separability. The Markov blanket for each node in this expanded graph corresponds 
to those nodes directly connected in the moralized expanded graph. Suppose we have the 
finest unique partition unknown(Xi) of the unknown nodes. Aj's are then reconstructed 
by adding known variables in the Markov blankets for variables in unknown(Xi). Suppose 
V is an unknown variable in a plate, and Vj are its instances once the plate is expanded. 
Now, by symmetry, every Vj is either in the same element of the finest partition, or they 
are all in separate elements. If Vj has a certain unknown variable in its Markov boundary 
outside the plate, then so must Vk for k j hj symmetry. Therefore Vj and Vk are in the 
same element of the partition. Hence by contradiction, if Vj is in a separate element, that 
element occurs wholly within the plate boundaries. Therefore, this finest partition can be 
represented using plates, and the finest partition identified from the graph ignoring plates. 
The operation of finding the finest separated sets in a graph is quadratic in the size of the 
graph, hence the 0(|Ap) complexity. 

Assume the condition holds and consider Equation (23). Let cliques(T) denote the 
subsets of variables in r U parents(T) that form cliques in the graph formed by restricting 
Gr to r U parents(T) and placing an undirected arc between all parents. Let t(X) be the 
set of chain components in X . From Frydenberg (1990), we have: 

p(x\M) = n n n aciQ) . 

t^t[X) i^ind^r) C^cliques^r) 

Furthermore, if m G Xi is not known, then the variables in m's Markov blanket will occur in 
Xi, and therefore, if m G C for some clique C, then C C Xi. Therefore cliques containing 
an unknown variable can be partitioned according to which subgraph they belong in. Let: 

cliques'j = {C : C G cliques(T(X )), unknown(Xj) fl C 7^ 0} , 

and add to this any remaining cliques wholly contained in the set so far: 

cliquesj = cliques'j U < C : C G cliques(T(X)), C C C' 

I C (zcliques'^ 

Call any remaining cliques cliquesQ. Therefore: 

p{x\M) = n n n . 

j=0 C'EdiqueSj i£ind{C) 
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Distribution 


Functional form 


j ~ C-dim multinomial(^i , ... ,9c) 
y\x ~ Gaussian(a;"'"^, (t) 
X ~ Gamma(Q! > 0,/3 > 0) 
6 ~ C-dini Dirichlet(Q!i , . . . , cic) 

X ~ rf-dim Gaussian(// G S G ?R.'^'^'^) 

S ~ rf-dim Wishart(a > rf, S G 9?'^^'^) 


forjG[l,...,C] 

Tfc exp (-^(y - a;t6')2) for y eU,x e 

r(a) 

Beta{o.x,...,o.c) ^^8 = 1 « 

<Mlj|^ exp (-\{x - //)tS(a; - for x G ^i'^ 

expf itraceS-15) 

2da/2^d(d-i)/4 r((a+i-j)/2) 2 ; 

for 5, S symmetric positive definite 



Table 2: Distributions and their functional form 



Results in: 

Furthermore, the potential functions on the cliques in Gi are well defined as described. 
A. 3 Proof of Corollary 6.1.1 

If X,j = Tj U ndparents(Tj), then every clique in a chain component in Tj will occur in 
cliques j. Therefore: 

n n 9c(Ci) = Yi p(T\ndparents(T)) , 

Cediquesj ieind{C) tETj 

= p(Tj\ndparents(Tj)) . 

A. 4 Proof of Lemma 6.3 

Consider the definition of the Markov blanket. If a directed arc is added between the nodes, 
then the Markov blanket will only change for an unknown node X if U now enters the set 
of non-deterministic parents of the chain-components containing non-deterministic children 
of X . This will not effect the subsequent graph separability, however, because it will only 
subsequently add arcs between U , a given node, and other nodes. 

Appendix B. The exponential family 

The exponential family of distributions was described in Definition 4.1. The common use of 
the exponential family exists because of Theorem 4.1. Table 2 gives a few exponential family 
distributions and their functional form. Further details and more extensive tables can be 
found in most Bayesian textbooks on probability distributions (DeGroot, 1970; Bernardo 
& Smith, 1994). Table 3 gives some standard conjugate prior distributions for those in 
Table 2, and Table 4 gives their matching posteriors (DeGroot, 1970; Bernardo & Smith, 
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Distribution 


Conjugate prior 


j ~ C-dim multinomial 
y\x ~ Gaussian 
X ~ Gamma 
X ~ (i-dim Gaussian 


9 ~ Diriclilet(Q!i , . . . , ac) 

6\a ~ (i-dim Gaussian(^07 ^Sq), ~ Gamma(Q!o/2, 2//3o) 
/3 a ~ Gamma(Q!o, /3o) 

S ~ Gaussian(;Uo7 ^oS), S ~ Wisliart(^07 5*0) 



Table 3: Distributions and their conjugate priors 



Distribution 


Conjugate posterior 


j ~ C-dim multinomial 


9 ~ Diriclilet(rai + cii, . . . , rac + ac) 

for = Ejll lj,=c = # < j's = c> 


y\x ~ Gaussian 


6\a ~ (i-dim Gaussian(^, ^S), (T~^ ~ Gamma((Q!o + iV)2, 2//3), 
for S = So + Eili y.yj, ^ = (So^o + Eili a^.y.) , 
for /3 = Eili(a;. - 0^y^f + (0- ^o)^So(^ - ^o) + /3o 


X ~ Gamma 


/3 a ~ Gamma(iVQ! + ciq, Ei^i ^« + Po) 


X ~ (i-dim Gaussian 


~ Gaussian(7I,(iV + iVo)S), S ~ Wisliart(iV + ^o, 5* + 5*0) 
for -p = X + j^^i^o x), 
for 5 = J2tLiixt - x)ix, - xy 



Table 4: Distributions and matching conjugate posteriors 
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Distribution 



Evidence 



j ~ C-dim multinomial 
y\x ~ Gaussian 

X ~ Gamma 

X ~ (i-dim Gaussian 



Beta(ni + cii, . . . , rac + ac)/ Beta(ai , . . . , ac) 

deti/2 Sn r((«o+JV)/2)/3("o + J^)/" 



7r^/2 detl/2S 



r(«o/2)/3;«/' 



/3"°r(Jv«+«o) 



r(«o)(E^,^,+/3o) 



iVa-|- ag 



for a fixed 



(^)dJV/2 det('^o + ^)/^(5+5o) (JV+JVq)' 



d r((§o+JV-l-»)/2) 



r((5o-i-0/2) 



Table 5: Distributions and their evidence 



1994). For the distributions in Table 2 with priors in Table 3 , Table 5 gives their matching 
evidence derived using Lemma 6.4 and cancelling a few common terms. 

In the case where the functions Wi are full rank'm 9 (dimension of 9 is k, same as w, and 

the Jacobian of v) with respect to 9 is invertible, det ^ ^ then various moments 

of the distribution can be easily found: 



^x\y,e {t{x,y)) 



Aw{9)Y^ AZ{9) 



A9 



A9 



m) 



The vector function w{9) now has an inverse and it is referred to as the link function 
(McCuUagh & Nelder, 1989). This yields: 



£x\y,e exp \ ^(t),t,{x,y) 



\i=l 



Z{w-H<i) + w{9))) 

m 



These are important because if the normalization constant Z can be found in closed form, 
then it can be differentiated and divided, for instance, symbolically, to construct formula for 
various moments of the distribution such as 8^\g {ti{x)) and 8^\g {ti{x)tj{x)). Furthermore, 
Equation (28) implies that derivatives of the normalization constant, dZ(^)/d^, can be 
found by estimating moments of the sufficient statistics (for instance, by Markov chain 
Monte Carlo methods). 
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